INTRODUCTION

The objective of my thesis is to identify the existence (and if so, measure the magnitude) of geographical pollution spillovers arising from the implementation in 2018 of the Madrid Central low emission zone (LEZ) through an empirical approach.

In this markdown file, I present the code used to reach the analyses and conclusions summarised in the dissertation “Beyond the Zone: Assessing Spillover Effects of Madrid Central on Air Pollution” (hereinafter “the main document”).

The report is structured as follows. Section 1 includes the pre-processing of the data (both pollution measures and information about pollution monitoring stations). Section 2 performs a descriptive analysis of pollution measurements by station, location and year, as well as of the location of stations with respect to the LEZ borders and the perimeter of the Community of Madrid. Section 3 presents visually the identification strategy. Section 4 introduces our model control variables, involving data wrangling, imputation of missing values and descriptive analysis by means of visualizations. Section 5 contains a series of Difference-in-Differences (DiD) models used to answer our research question, which follow some pre-modelling feature engineering and assumption-checking plots. Finally, Sections 6 and 7 encompass robustness checks and the appendix, respectively.

We begin by loading the necessary libraries for our analysis. Some of them are used for data visualization and manipulation, such as tidyverse, ggplot2, ggdag, lubridate or forcats. Others are used for geospatial analysis, mapping and visualization, namely geosphere, sf, mapview, ggmap and maptools. Additionally, we need libraries for statistical analysis, in our case plm, lmtest, estimatr, fixest and mice. Lastly, we load libraries for formatting and output like stargazer, cowplot, corrplot, knitr and kableExtra. We use vistime()and highcharter() to display the policy timeline in the Appendix.

library(tidyverse)
library(ggplot2)
library(zoo)
library(ggdag)
library(lubridate)
library(forcats)
library(geosphere)
library(climaemet)
library(sf)
library(mapview)
library(ggmap)
library(maptools)
library(plm)
library(lmtest)
library(estimatr)
library(fixest)
library(mice)
library(miceadds)
library(cowplot)
library(corrplot)
library(stargazer)
library(knitr)
library(kableExtra)
library(vistime)
library(highcharter)

1. DATA PRE-PROCESSING

1.1. Pollution data for pollution monitoring stations inside the City of Madrid (2015 - 2022)

We download hourly pollution data from the data portal of the City Council for 24 different pollution monitoring stations in the City of Madrid for all months between January 2015 and December 2022 (data is available since 2001).

We first create a loop to read pollution data for all years and clean the data to get to a dataframe called ‘all_city_data_clean’. Important steps in this process include renaming and selecting relevant columns, transforming the data from wide to long format, filtering the dataframe for NO2 (nitrogen dioxide, measured in micrograms per cubic metre or µg/m3) measures and adjusting variable types. We then aggregate the data to the monthly level (‘city_monthly’).

Also, we are interested in checking whether there are any stations with fewer than 18 observations per day and 270 observations per year (around 75% of total observations). We find that only the pollution monitoring station in Arturo Soria (station code 16) has fewer than 270 observations per year (264). However, we can see that this is only the case for 2022 and we will not use observations from this year for reasons introduced later. Hence, we do not remove the data yet to ensure comparability with the Community dataset (Section 1.2).

# `For` loop for hourly data for all months in all years
file_names <- c(paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo15.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo16.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo17.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo18.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo19.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo20.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo21.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"),
                       "_mo22.csv"))

# Create an empty dataframe to store the cleaned data
all_city_data_clean <- data.frame()

# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names) {
  
  # Read the data
  city_data <- read_delim(file_name, delim=";")
  
  # Apply transformations 
  city_data_clean <- city_data %>% 
    select(-c(PROVINCIA, PUNTO_MUESTREO)) %>% 
    rename(town_code = MUNICIPIO,
           station_code = ESTACION,
           magnitude = MAGNITUD,
           year = ANO,
           month = MES,
           day = DIA) %>% 
    select(-matches("^V")) %>%
    pivot_longer(cols = starts_with("H"), names_to = "hour", values_to = "pollution_level") %>%
    # Keep only NO2 measures and remove 'magnitude' column
    filter(magnitude == "8") %>% 
    select(-magnitude) %>%
    # Adjust variable types
    mutate(month = as.numeric(month), 
           day = as.numeric(day),
           hour = as.numeric(str_replace(hour, "H", "")),
           # Remove leading zeros from 'code' variables
           town_code = as.factor(str_remove(town_code, "^0+")),
           station_code = as.factor(str_remove(station_code, "^0+")),
           # Replace comma with point for 'pollution_level'
           pollution_level = as.numeric(gsub(",", ".", pollution_level)))
    
  # Append the cleaned data to the 'all_city_data_clean' dataframe
   all_city_data_clean <- bind_rows(all_city_data_clean, city_data_clean)
}

# Check for stations with less than 18 observations per day and 270 observations per year 
stations_to_exclude <- all_city_data_clean %>% 
  group_by(town_code, station_code, year, month, day) %>% 
  summarise(n_obs = n()) %>%
  group_by(town_code, station_code, year) %>% 
  summarise(n_obs_per_year = n(), 
            n_obs_per_day = mean(n_obs)) %>%
  filter(n_obs_per_day < 18 | n_obs_per_year < 270) 

print(stations_to_exclude)
## # A tibble: 1 × 5
## # Groups:   town_code, station_code [1]
##   town_code station_code  year n_obs_per_year n_obs_per_day
##   <fct>     <fct>        <dbl>          <int>         <dbl>
## 1 79        16            2022            264            24
# Aggregate the data to monthly level
city_monthly <- all_city_data_clean %>% 
  group_by(town_code, station_code, year, month) %>% 
  summarise(pollution_level = mean(pollution_level)) 

1.2. Pollution data for pollution monitoring stations outside the City of Madrid (2015 - 2022)

We employ a similar strategy involving loops to read pollution data from pollution monitoring stations outside the City of Madrid from 2015 to 2022. Hourly data for the 24 stations inside the Community of Madrid (excluding the Capital) since 2005 are available here. We apply virtually identical cleaning and aggregation steps to create our dataframe with monthly observations.

This time we find no stations with incomplete measurements when running the lines which create the object ‘stations_to_exclude’.

# Create file names vector
file_names <- c(paste0("com_", c(2015:2022), ".csv"))

# Create an empty dataframe to store the cleaned data
all_com_data_clean <- data.frame()

# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names) {
  
  # Read the data
  com_data <- read_delim(file_name, delim=";")
  
  # Apply transformations 
  com_data_clean <- com_data %>% 
    dplyr::select(-c(provincia, punto_muestreo)) %>% 
    rename(town_code = municipio,
           station_code = estacion,
           magnitude = magnitud,
           year = ano,
           month = mes,
           day = dia) %>% 
    select(-matches("^V")) %>%
    pivot_longer(cols = starts_with("h"), names_to = "hour", values_to = "pollution_level") %>%
    # Keep only NO2 measures and remove 'magnitude' column
    filter(magnitude == "8") %>% 
    select(-magnitude) %>%
    # Adjust variable types, remove leading zeros and replace points with commas 
    mutate(month = as.numeric(month),
           day = as.numeric(day),
           hour = as.numeric(str_replace(hour, "h", "")),
           town_code = as.factor(str_remove(town_code, "^0+")),
           station_code = as.factor(str_remove(station_code, "^0+")),
           pollution_level = as.numeric(gsub(",", ".", pollution_level))) 
   
  # Append the cleaned data to the 'all_com_data_clean' dataframe and drop NA
  all_com_data_clean <- bind_rows(all_com_data_clean, com_data_clean) %>% 
    drop_na(pollution_level) 
}

# Check for stations with less than 18 observations per day and 270 observations per year 
stations_to_exclude <- all_com_data_clean %>% 
  group_by(town_code, station_code, year, month, day) %>% 
  summarise(n_obs = n()) %>%
  group_by(station_code, year) %>% 
  summarise(n_obs_per_year = n(), 
            n_obs_per_day = mean(n_obs)) %>%
  filter(n_obs_per_day < 18 | n_obs_per_year < 270) %>%
  pull(station_code)

print(stations_to_exclude)
## factor()
## Levels: 1 14 2 3 4 5 7
# Aggregate the data to monthly level
com_monthly <- all_com_data_clean %>% 
  group_by(town_code, station_code, year, month) %>% 
  summarise(pollution_level = mean(pollution_level)) 

Complete dataset of monthly pollution observations for the air quality network of the Community of Madrid

We now combine the monthly pollution data from the 24 stations inside the City of Madrid and the 24 stations outside the City and create a dataframe called ‘full_data’ with the following columns: ‘town_code’ (see correspondence with municipality names in Sections 1.3 and 1.4), ‘station_code’, ‘year’, ‘month’ and ‘pollution_level’ (measured by NO2 concentration).

# Bind together observations of both datasets  
full_data <- bind_rows(city_monthly, com_monthly)

1.3. Data for pollution monitoring stations inside the City of Madrid

We now download data pertaining to the pollution monitoring stations inside the City of Madrid. The network is made up of 24 automatic stations that collect basic information for atmospheric surveillance.

We apply data cleaning tools similar to the ones before. Note that we create a column for the town code since this is a variable in the dataset with stations outside the City of Madrid (Section 1.4) and we want both datasets to have the same columns to combine them later.

We then extract the longitude and latitude of all stations in the network. We generate a variable called ‘distance_km’ representing the distance of each station to the station in Plaza del Carmen (only one inside the LEZ and our policy centroid, see Section 2.3). We compute these distances using the Haversine formula. We then arrange all stations by increasing distance from the city centre (Plaza del Carmen). We will use this information to create our treatment rings later on.

# Open dataset with station features
city_stations <- read_delim("informacion_estaciones_red_calidad_aire.csv", delim=";")

# Rename and select variables
city_stations_clean <- city_stations %>% 
  # Create column for town_code and transform to factor
   mutate(town_code = "079", 
          town_code = as.factor(str_remove(town_code, "^0+"))) %>% 
   select(
     town_code,
     station_code = CODIGO_CORTO,
     station_name = ESTACION,
     address = DIRECCION,
     station_type = NOM_TIPO,
     longitude = LONGITUD,
     latitude = LATITUD) %>% 
    mutate(station_code = as.character(station_code))

# Replace "á" with "a" 
city_stations_clean <- city_stations_clean %>% 
  mutate(station_type = gsub("á", "a", station_type),
         #If 'station_type' is "Suburbana", replace with "Suburbana (ciudad)" for comparison with out-of-City stations
         station_type = ifelse(station_type == "Suburbana", "Suburbana (ciudad)", station_type))

# Extract latitude and longitude of Plaza del Carmen 
plaza_carmen_longitude <- city_stations_clean$longitude[city_stations_clean$station_name == "Plaza del Carmen"]
plaza_carmen_latitude <- city_stations_clean$latitude[city_stations_clean$station_name == "Plaza del Carmen"]

# Calculate the distance between each station and Plaza del Carmen
city_stations_clean <- city_stations_clean %>%
  # Use `distHaversine()` function to calculate the distance between each station and Plaza del Carmen 
  mutate(distance_km = round(distHaversine(cbind(longitude, latitude), 
                                           c(plaza_carmen_longitude, plaza_carmen_latitude)) / 1000, 2)) %>% 
  # Arrange the data in ascending order by 'distance_km'
  arrange(distance_km)

# Display first few rows of the cleaned dataset
head(city_stations_clean)
## # A tibble: 6 × 8
##   town_code station_code station_name    address station_type longitude latitude
##   <fct>     <chr>        <chr>           <chr>   <chr>            <dbl>    <dbl>
## 1 79        35           Plaza del Carm… "Plaza… Urbana fondo     -3.70     40.4
## 2 79        4            Plaza de España "Plaza… Urbana traf…     -3.71     40.4
## 3 79        8            Escuelas Aguir… "Entre… Urbana traf…     -3.68     40.4
## 4 79        49           Parque del Ret… "Paseo… Urbana fondo     -3.68     40.4
## 5 79        48           Castellana      "C/ Jo… Urbana traf…     -3.69     40.4
## 6 79        47           Méndez Álvaro   "C/ Ju… Urbana fondo     -3.69     40.4
## # ℹ 1 more variable: distance_km <dbl>

For each station, we have information on their code (so that station and pollution data can be merged later on), address, location (i.e. longitude and latitude) and distance to the centre. We also see that pollution monitoring stations are of several types:

  • Urban background, which represent the exposure of the urban population to pollution in general;

  • Urban traffic, whose pollution measurements are mainly influenced by emissions from nearby streets or highways due to their strategic location;

  • Suburban, which generally have lower pollution levels (but higher ozone ones) because of their location on the outskirts of the City. We assign these stations the name “Suburbana (ciudad)” to create a distinction with suburban stations outside the City of Madrid.

1.4. Data for pollution monitoring stations outside the City of Madrid

Along the lines of the last section, we download and read data of pollution monitoring stations outside the Capital from the website of the Community of Madrid. We clean the data using regular expressions to fix the formatting of the longitude, latitude and address columns which had several issues (e.g. unnecessary quotation marks or commas). We also merge the columns containing information about area type and station type into a new column with the name of the latter to ensure comparability with data on stations inside the City of Madrid. We arrange the resulting dataframe so that stations closest to Plaza del Carmen are at the top.

# Open dataset with station features
com_stations <- read_delim("com_stations.csv", delim=";")

# Rename and select variables 
com_stations_clean <- com_stations %>% 
  # Extract the town and station codes from the full station code and adjust their types
   mutate(town_code = substr(estacion_codigo, 3, 5),
          town_code = as.factor(str_remove(town_code, "^0+")),
          station_code = substr(estacion_codigo, 6, 8),
          station_code = as.factor(str_remove(station_code, "^0+"))) %>% 
   select(town_name = estacion_municipio,
          town_code,
          station_code,
          station_type = estacion_tipo_estacion,
          area_type = estacion_tipo_area,
          rural_area_type = estacion_subarea_rural,
          address = estacion_direccion_postal,
          longitude = estacion_coord_longitud,
          latitude = estacion_coord_latitud) 

# Fix the latitude and longitude columns using regular expressions
com_stations_clean <- com_stations_clean %>%
  # Remove quotes at the beginning and end of the longitude column
  mutate(longitude = str_remove(longitude, "^\"|\",\\s*\"$")) %>%
  # Replace commas with periods in the longitude column and convert to numeric
  mutate(longitude = as.numeric(gsub(",", ".", longitude))) %>% 
  # Remove quotes and commas from the end of the latitude column
  mutate(latitude = str_remove(latitude, "\"?,?$")) %>%
  # Replace commas with periods in the latitude column and convert to numeric
  mutate(latitude = as.numeric(gsub(",", ".", latitude))) 

# Remove any sequence of two commas AND quotation marks from the address column
com_stations_clean <- com_stations_clean %>% 
  # Remove quotes from the address column
  mutate(address = str_remove(address, '["\']')) %>% 
  # Remove any sequence of two commas from the address column
  mutate(address = str_remove(address, ",{2}"))

# Create a new column that concatenates the 'area_type' and 'station_type' columns
com_stations_clean <- com_stations_clean %>% 
  # Combine the 'area_type' and 'station_type' columns with a space separator
  mutate(station_type_new = paste(area_type, station_type, sep = " "),
         # Use the `sapply()` function to convert the text in the 'station_type_new' column to title case
         station_type_new = sapply(strsplit(as.character(station_type_new), " "), 
                               function(x) paste0(x[1], " ", tolower(substring(x[2], 1, 1)), 
                                                  substring(x[2], 2), collapse = " "))) %>% 
  # Remove the 'area_type' and 'station_type' columns and rename the new column as 'station_type'
  select(-c(area_type, station_type)) %>% 
  rename(station_type = station_type_new)

# Calculate the distance between each station and Plaza del Carmen
com_stations_clean <- com_stations_clean %>%
  # Use `distHaversine()` function to calculate the distance between each station and Plaza del Carmen 
  mutate(distance_km = round(distHaversine(cbind(longitude, latitude), 
                                           c(plaza_carmen_longitude, plaza_carmen_latitude)) / 1000, 2)) %>% 
  # Arrange the data in ascending order by 'distance_km'
  arrange(distance_km)

# Display first few rows of the cleaned dataset
head(com_stations_clean)
## # A tibble: 6 × 9
##   town_name   town_code station_code rural_area_type address  longitude latitude
##   <chr>       <fct>     <fct>        <chr>           <chr>        <dbl>    <dbl>
## 1 Leganes     74        7            No aplica       C.E.P.A…     -3.75     40.3
## 2 Getafe      65        14           No aplica       C.E.I.P…     -3.72     40.3
## 3 Coslada     49        3            No aplica       Polidep…     -3.54     40.4
## 4 Alcorcon    7         4            No aplica       Colegio…     -3.83     40.3
## 5 Alcobendas  6         4            No aplica       Parque …     -3.65     40.5
## 6 Majadahonda 80        3            No aplica       Campo d…     -3.87     40.4
## # ℹ 2 more variables: station_type <chr>, distance_km <dbl>

We also have 24 stations. After our transformation, we see that there are six types of stations according to both the type of station and area: urban background, urban traffic, suburban background, suburban traffic, urban industrial and rural background.

Complete dataset of data for all stations in the air quality network of the Community of Madrid

After selecting the relevant variables, we combine the data of all 48 stations in the Community of Madrid into a dataframe called ‘data_all_stations’.

# Select key variables for in-City stations
city_selection <- city_stations_clean %>% 
  select(town_code, station_code, station_type, address, longitude, latitude, distance_km)

# Select key variables for out-of-City stations
com_selection <- com_stations_clean %>% 
  select(town_code, station_code, station_type, address, longitude, latitude, distance_km)

# Combine both datasets and arrange in ascending order of distance from Plaza del Carmen
data_all_stations <- bind_rows(city_selection, com_selection) %>% arrange(distance_km)

2. EXPLORATORY DATA ANALYSIS

After loading and formatting our pollution and pollution station data, we create a series of visualizations to understand it.

2.1. Comparison of average pollution levels across stations in the Community of Madrid (2022)

By way of introduction, we might be interested in how average pollution levels (NO2) compare across the different stations in the City of Madrid at a given point in time. We select year 2022 as it encompasses (so far) the most recent available data in our analysis. We assign labels with the names of the municipalities in Madrid to the different town codes. We then create our plot by grouping our data by town and obtaining the mean and standard error for the pollution level. We plot a point for each municipality with error bars indicating the standard error.

# Assign labels to levels of 'town_code'
my_levels <- c(5, 6, 7, 9, 13, 14, 16, 45, 47, 49, 58, 65, 67, 74, 79, 80, 92, 102, 120, 123, 133, 148, 161, 171, 180)

my_labels <- c("Alcala de Henares", "Alcobendas", "Alcorcon", "Algete", "Aranjuez", "Arganda del Rey", 
               "El Atazar", "Colmenar Viejo", "Collado Villalba", "Coslada", "Fuenlabrada", "Getafe", 
               "Guadalix de la Sierra", "Leganes", "Madrid", "Majadahonda", "Mostoles", 
               "Orusco de Tajuna", "Puerto de Cotos", "Rivas-Vaciamadrid", "San Martin de Valdeiglesias",
               "Torrejon de Ardoz", "Valdemoro", "Villa del Prado", "Villarejo de Salvanes")

# Select necessary variables for the plot
full_data_plot <- full_data %>%
  mutate(town_code_label = factor(town_code, levels = my_levels, labels = my_labels))

# Confidence interval plot for all stations in the Community of Madrid
full_data_plot %>%
  filter(year == 2022) %>%
  group_by(town_code_label) %>%
  summarise(pollution_level_mean = mean(pollution_level),
            pollution_level_se = sd(pollution_level)/sqrt(n())) %>%
  ggplot(aes(reorder(town_code_label, pollution_level_mean), pollution_level_mean)) +
  geom_point() +
  geom_errorbar(aes(ymin = pollution_level_mean - 1.96 * pollution_level_se,
                    ymax = pollution_level_mean + 1.96 * pollution_level_se),
                width = 0.4) +
  labs(x = "Municipality Monitoring Station",
       y = expression("Average Monthly NO"[2] ~ "(95% Confidence Interval)"),
       title = "Air Quality by Municipality in the Community of Madrid (2022)") +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 20)) +  
  ylim(0, 40) +
  theme_light() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
        axis.title.x = element_text(margin = margin(t = 10), size = 10),
        axis.title.y = element_text(size = 10),
        plot.margin = margin(b = 20)) 

There are significant differences between the pollution levels in large municipalities close to Madrid (like Leganés or Getafe) and those in smaller, rural areas (e.g. Puerto de Cotos). We will see later that this divergence of rural pollution stations makes them inadequate as control units in our analysis.

2.2. Comparison of average yearly pollution levels and compliance with Directive 2008/50/EC between in-City and out-of-City stations

From the previous chart, we hypothesize that there may be an important difference between pollution levels in the City of Madrid (due to larger traffic flows) and those outside (despite the heterogeneity among the latter as seen in the previous plot). Hence, we will now compare both average pollution levels and the number of days with excessive air pollution levels by year and location.

Average pollution (NO2) by year and location

To check how average pollution levels have evolved between 2015 and 2022 inside vis-à-vis outside the City of Madrid, we aggregate both sets of data (in-City vs. out-of-City) by year and compute the mean. We then combine the data and pivot it to wide format.

# Aggregate hourly data from inside the City of Madrid to yearly level and create 'Location' column 
city_yearly <- all_city_data_clean %>% 
  group_by(year) %>%
  summarise(avg_pollution_level = round(mean(pollution_level), 2),
            # create 'Location' column to identify values for inside the City
            Location = "Inside City of Madrid") 
  
# Aggregate hourly data from outside the City of Madrid to yearly level and create 'Location' column
com_yearly <- all_com_data_clean %>% 
  group_by(year) %>% 
  summarise(avg_pollution_level = round(mean(pollution_level), 2),
            # create 'Location' column to identify values for outside the City
            Location = "Outside City of Madrid") 

# Combine city and out-of-City data
summary_table_1 <- bind_rows(city_yearly, com_yearly)

# Convert the summary table to wide format (years as columns and areas as rows)
summary_table_wide_1 <- pivot_wider(summary_table_1, names_from = "year", 
                                    values_from = "avg_pollution_level", 
                                    values_fn = mean)

# Format and print the summary table
kable(summary_table_wide_1, format = "html", row.names = FALSE) %>%
  kable_styling(c("striped", "bordered"), full_width = FALSE) 
Location 2015 2016 2017 2018 2019 2020 2021 2022
Inside City of Madrid 40.90 38.51 41.54 36.54 34.50 27.21 28.94 28.22
Outside City of Madrid 25.73 23.79 26.44 22.10 21.12 16.97 17.50 17.45

We see that year 2017 saw the highest average pollution levels both inside and outside the City of Madrid. In fact, in 2015 and 2017 the Capital was surpassing the limit of 40 µg/m3 for yearly average NO2 concentrations set by the Ambient Air Quality Directive. Still, throughout all years under consideration, NO2 levels in the air are considerably higher in the Capital than in the rest of the Community.

Number of times with average pollution levels above Directive limits by year and location

The limit of 40 µg/m3 is set on a yearly basis, which we just saw was violated for 2 years in the Capital. However, we might also want to dive a bit deeper and check whether the hourly limit of 200 µg/m3 (which shall not to be exceeded more than 18 times in a calendar year according to the Directive) was complied with.

We thus calculate the number of times (i.e. hours) the pollution level exceeded 200 each day for each location. We then combine the data from both datasets and create a summary table that shows the number of times the pollution level exceeded 200 µg/m3 for each year and location. That is, we take the sum of all days for which the pollution level exceeded 200 µg/m3 at least once for each group. We then pivot and display the data below.

# Create daily summaries of pollution level for stations both inside and outside the city 
city_daily <- all_city_data_clean %>% 
  group_by(year, month, day) %>%
  summarise(pollution_count = sum(pollution_level > 200))

com_daily <- all_com_data_clean %>% 
  group_by(year, month, day) %>% 
  summarise(pollution_count = sum(pollution_level > 200))

# Combine city and out-of-City data
all_daily <- bind_rows(
  city = city_daily %>% mutate(Location = "Inside the City"),
  com = com_daily %>% mutate(Location = "Outside the City"))

# Count number of times average hourly pollution levels exceeded 200 µg/m3 for each year and area
summary_table_2 <- all_daily %>%
  group_by(Location, year) %>%
  summarise(num_times = sum(pollution_count > 0))

# Convert the summary table to wide format
summary_table_wide_2 <- summary_table_2 %>%
  pivot_wider(names_from = "year", values_from = "num_times") 

# Format and print the summary table
kable(summary_table_wide_2, format = "html", row.names = FALSE) %>%
  kable_styling(c("striped", "bordered"), full_width = FALSE)
Location 2015 2016 2017 2018 2019 2020 2021 2022
Inside the City 61 47 63 38 43 13 15 8
Outside the City 28 5 12 0 1 0 0 0

Only from 2020 the number of times per year where hourly pollution levels exceeded 200 µg/m3 in Madrid was less than 18. Outside the city of Madrid only 2015 witnessed a violation of the rule for hourly NO2 concentrations as prescribed by the Directive.

2.3. Visualization of the location of pollution monitoring stations

Perimeter of the Madrid Central Low Emission Zone (LEZ) and pollution monitoring stations in the City of Madrid

Clearly, the LEZ in Madrid is meant to target the comparatively higher levels of pollution in the Capital. Hence, we would like to map the distribution of pollution stations in Madrid in relation to the LEZ perimeter (and the broader area of the City).

We convert our dataframe containing information on pollution stations to a spatial object using the longitude and latitude columns to define the coordinates and specifying the coordinate reference system (CRS) as EPSG:4326 (which represents WGS 84, a commonly used global CRS for geographic data).

We then load the LEZ and administrative boundary data we obtained from the geoportal of the Madrid City Council and transform the former to the same CRS as the station data. We do this to ensure both objects align for spatial analysis. We then spatially join the station and LEZ data, resulting in an object called ‘stations_in_LEZ’ that contains information about which stations are located within the restricted area. We determine whether a station is located inside or outside the Madrid Central (MC) area by its exact coordinates. Hence, we create a binary variable called ‘LEZ’ that is set to 1 for stations within the LEZ and 0 for stations outside.

First we display the resulting stations and the LEZ perimeter using the mapview() function. We then use the same function to create a second map that also incorporates the administrative boundaries of the City of Madrid for the reader to compare the relative size of the restricted area.

# Convert station data to spatial object
city_stations_coord.spdf <- st_as_sf(city_stations_clean, 
                                     coords = c("longitude", "latitude"),
                                     crs = 4326)

# Load and transform LEZ data to match station coordinates (same CRS)
LEZ_limits <- st_read("ZBEDEP_Distrito_Centro_Limite.shp")
## Reading layer `ZBEDEP_Distrito_Centro_Limite' from data source 
##   `C:\Users\elena\OneDrive\Desktop\MUCSS\V. TFM\master-thesis-elena-yustres\ZBEDEP_Distrito_Centro_Limite.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 439078.8 ymin: 4472973 xmax: 441372.8 ymax: 4475758
## Projected CRS: ETRS89 / UTM zone 30N
LEZ_limits_t <- st_transform(LEZ_limits, st_crs(city_stations_coord.spdf))

# Spatially join stations and LEZ data
stations_in_LEZ <- st_join(city_stations_coord.spdf, LEZ_limits_t, join = st_within)

# Create binary variable to denote if station is within the LEZ 
stations_in_LEZ$LEZ <- 
  ifelse(is.na(stations_in_LEZ$AMBITO_GEO), 0, 1)

# Filter for only station inside LEZ
stations_in_LEZ %>% filter(LEZ == 1)
## Simple feature collection with 1 feature and 8 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -3.703166 ymin: 40.41921 xmax: -3.703166 ymax: 40.41921
## Geodetic CRS:  WGS 84
## # A tibble: 1 × 9
##   town_code station_code station_name     address       station_type distance_km
## * <fct>     <chr>        <chr>            <chr>         <chr>              <dbl>
## 1 79        35           Plaza del Carmen Plaza del Ca… Urbana fondo           0
## # ℹ 3 more variables: geometry <POINT [°]>, AMBITO_GEO <chr>, LEZ <dbl>
# Display the stations on the map with the LEZ perimeter
mapview(stations_in_LEZ, label = "CENTRO", col.regions = "orange") + mapview(LEZ_limits_t, alpha.regions = 0.2)
# Load object with administrative boundaries of Madrid
madrid <- st_read("madrid_.geojson")
## Reading layer `madrid_' from data source 
##   `C:\Users\elena\OneDrive\Desktop\MUCSS\V. TFM\master-thesis-elena-yustres\madrid_.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 128 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -3.887647 ymin: 40.31325 xmax: -3.516929 ymax: 40.64444
## Geodetic CRS:  WGS 84
# Display the stations on the map with LEZ perimeter and Madrid administrative boundaries
mapview(stations_in_LEZ, label = "CENTRO", col.regions = "orange") + mapview(madrid, alpha.regions = 0.1) + mapview(LEZ_limits_t, alpha.regions = 0.5)

The reader is encouraged to hover over both maps and chick on any given station to find information about its name, type and address, among others.

From the dataframe ‘stations_in_LEZ’ and both plots we can gather that only the station in Plaza del Carmen is inside the LEZ. This implies we can treat it as our “policy centroid”. Also, the LEZ is relatively small compared to the area of the entire municipality.

Specification of rings based on distance from Plaza del Carmen

With the goal of quantifying the spillover effects of the policy beyond the borders of the LEZ, we define a set of rings which segment all 48 stations in the Community of Madrid network according to their distance from the Madrid city centre.

We define arbitrary cutoff points at 2.5, 8, 15, 20 and 40 kilometres from Plaza del Carmen, thereby yielding six different rings or categories of stations. The six rings have 4, 12, 14, 4, 6 and 8 stations, respectively.

# Specify cutoffs for 'distance_km' and assign ring labels
data_all_stations$ring <- cut(data_all_stations$distance_km, breaks = c(-Inf, 2.5, 8, 15, 20, 40, Inf),
                     labels = c("Ring 1: Treated", 
                                "Ring 2: Treated", 
                                "Ring 3: Treated", 
                                "Ring 4: Treated", 
                                "Ring 5: Control", 
                                "Ring 6: Excluded"))

# Display number of stations in each ring
data_all_stations %>% group_by(ring) %>% count()
## # A tibble: 6 × 2
## # Groups:   ring [6]
##   ring                 n
##   <fct>            <int>
## 1 Ring 1: Treated      4
## 2 Ring 2: Treated     12
## 3 Ring 3: Treated     14
## 4 Ring 4: Treated      4
## 5 Ring 5: Control      6
## 6 Ring 6: Excluded     8

Next, we set a bounding box that encompasses the Community of Madrid by defining its minimum and maximum longitude and latitude coordinates. We then retrieve the map from the Stamen Maps API using get_stamenmap(). The map is plotted using ggmap(), which allows us to add the map tiles to a ‘ggplot2’ object. The ‘extent’ argument is set to “normal” to ensure that the aspect ratio of the map is preserved and ‘maprange’ is set to “FALSE” to prevent ggmap() from adjusting the axes of the plot. Each station is a point on the map and its corresponding ring is coded through the color aesthetic.

# Set bounding box coordinates
bbox_madrid <- c(-4.647491, 39.804255 , -2.942500, 41.221046)

# Retrieve map using `get_stamenmap()`
madrid_map <- get_stamenmap(bbox = bbox_madrid, zoom = 10, maptype = "toner-lite")

# Define colors for each ring
ring_colors <- c("#23541e", "#398f33", "#70cc3e", "#a9ff4d", "#1AA7EC", "#b30000")

# Create plot
ggmap(madrid_map, extent = "normal", maprange = FALSE, alpha = 0.3) +
  geom_point(data = data_all_stations,
             aes(x = longitude, y = latitude, color = ring),
             size = 1.5) +
  scale_color_manual(values = setNames(ring_colors, levels(as.factor(data_all_stations$ring)))) +
  labs(title = "Location of Pollution Monitoring Stations in the Community of Madrid: Specification by Rings",
       color = "Ring") +
  theme_void() +
  theme(legend.position = "bottom",
        legend.background = element_rect(colour = "black", size = 0.3),
        legend.key.size = unit(0.5, "cm"),
        legend.text = element_text(size = 7),
        legend.title = element_text(size = 8),
        legend.box.spacing = unit(0.1, "cm"),
        legend.box.margin = margin(0, 0, 0, 0),
        legend.box.background = element_rect(color = "black", linewidth = 0.2),
        plot.title = element_text(size = 12, hjust = 0.5))

2.4. Monthly average pollution by station type and location (2015 - 2022)

To close off the descriptive analysis on pollution-related data, we want to know if there are any differences in the evolution of NO2 pollution levels (2015 - 2022) between urban and suburban stations (see description in Section 1.3).

We start by merging monthly pollution data for the city of Madrid with information about those same stations on the one hand, and pollution data for stations outside city of Madrid with information about those other stations on the other. We add a variable called ‘type’ to both datasets to indicate if a given station is urban or suburban.

Importantly, we remove rural (background) stations from the dataset with out-of-Capital stations because, as we saw before, these stations are intrinsically different from those in urban settings even outside the city. They are located in places with very little traffic and thus, when put together with other stations outside the City of Madrid, differences in pollution levels with City stations are falsely amplified.

We combine our dataframes and replace NAs in the ‘town_name’ column by ‘Madrid’ since the ‘city_stations_clean’ dataframe lacks this column in contrast with ‘com_stations_clean’. We only want observations for January and July for each year since we are also interested in analyzing whether there are significant differences between different months in the year.

Finally, we create the column ‘type_color’ to categorize our observations into four different labels based on the station type and location (in- vs. out-of-City). We calculate monthly averages for each category by grouping the data by ‘year’, ‘month’, ‘type’ and ‘type_color’ and plot them using the color and line type aesthetics.

# Join 'city_monthly' with 'city_stations' by 'town_code' and 'station_code'
city_data <- city_monthly %>%
  left_join(city_stations_clean, by = c("town_code", "station_code"))

# Create a new column ('type') to indicate if the station is urban or suburban (in-City data)
city_data$type <- if_else(city_data$station_type == "Urbana trafico" | city_data$station_type == "Urbana fondo", 
                          "Urban", "Suburban")

# Join 'com_monthly' with 'com_stations' by 'town_code' and 'station_code' 
com_data <- com_monthly %>%
  left_join(com_stations_clean, by = c("town_code", "station_code")) %>% 
  # Remove 'rural stations'
  filter(station_type != "Rural fondo")

# Create a new column ('type') to indicate if the station is urban or suburban (out-of-City data)
com_data$type <- ifelse(grepl("^Urbana", com_data$station_type), "Urban", "Suburban")

# Combine both dataframes into a single one for the whole Community
full_data_complete <- bind_rows(city_data, com_data) %>% 
  select(-c(station_name, rural_area_type))

# Replace the missing values in 'town_name' by Madrid
full_data_complete$town_name <- ifelse(is.na(full_data_complete$town_name), "Madrid", full_data_complete$town_name) 
# Filter to include only January and July of each year 
full_data_plot <- full_data_complete %>%
  filter(month %in% c(1, 7)) %>% 
  # Create column with 4 different labels with combinations of type and location 
  mutate(type_color = if_else(town_code == 79 & type == "Urban", "City_Urban",
                    if_else(town_code == 79 & type == "Suburban", "City_Suburban",
                    if_else(town_code != 79 & type == "Urban", "NonCity_Urban", "NonCity_Suburban")))) %>% 
  # Obtain monthly averages for each category
  group_by(year, month, type, type_color) %>% 
  summarise(pollution_level = mean(pollution_level)) 
  
# Plot the data
full_data_plot %>% 
  ggplot(aes(x = as.Date(paste(year, month, "01", sep = "-")), 
             y = pollution_level, 
             color = ifelse(grepl("_Urban", type_color), "Urban Station", "Suburban Station"),
             linetype = ifelse(grepl("^City", type_color), "Inside City", "Outside City"))) +
  geom_line(linewidth = 0.4) +
  labs(x = "Year", 
       y = expression("Monthly NO"[2] ~ "Pollution Averages"), 
       linetype = "Station Location", 
       color = "Station Type", 
       title = expression("Evolution of NO"[2] ~ "Levels in the Community of Madrid by Location and Station Type (2015 - 2022)")) +
  scale_linetype_manual(values = c("solid", "dashed")) +
  scale_color_manual(values = c("darkred", "darkblue")) +
  theme_light() +
  theme(plot.title = element_text(size = 11))

First, there does seem to be a clear difference between urban and suburban stations, both when considering stations inside and outside the City of Madrid. Urban stations (blue) appear to always have higher average pollution levels than suburban ones (red), regardless of whether the station is inside or outside the Capital.

Second, we see that average pollution is higher in January than in July for every year, type of station and location. This could be due to increased traffic during holiday season compared to the summer months, the increased use of heating due to lower temperatures or meteorological conditions themselves (such as low wind speeds). These differences highlight the need for incorporating weather-related controls (see Section 4 below) and month fixed effects (see Section 5) into our model.

3. IDENTIFICATION: THE CAUSAL MECHANISM OF MADRID CENTRAL AND POLLUTION IN NEIGHBOURING AREAS

Let’s take a step back. We are trying to assess how the Madrid LEZ affects the pollution levels outside the perimeter of the area under restrictions. We make use of the following Directed Acyclic Graph (DAG) to visually display our identification strategy aimed at isolating and quantifying this effect (Pearl 2009).

We use the dagify() function from the ggdag package. Each node represents a variable in the causal model and the edges represent the causal relationships between them. We specify the treatment as the Madrid LEZ and the outcome variable as pollution levels outside the LEZ. We then define three possible mechanisms or ways by which our treatment may affect our outcome, giving rise to the following mediators:

  1. Redirection of traffic from areas subject to restrictions to areas without restrictions, thereby increasing pollution outside the LEZ

  2. Increased use of public transport to avoid restrictions on private vehicles, thereby decreasing pollution outside the LEZ (as well as inside)

  3. Replacement of highly polluting vehicles with “greener” ones by drivers to avoid fines, thereby decreasing pollution outside the LEZ (as well as inside)

Finally, we theorize that there may be some potential confounders which could bias our estimate by causing a change in pollution levels not attributable to the LEZ. First, macroeconomic conditions may affect the use of private vehicles (and hence of public transportation by the substitution effect) and the willingness or ability to purchase new ones, thus affecting mediators (2) and (3). Second, large public works in the city centre could have an important effect on channels (1) and (2) by “forcing” people to take alternative routes to avoid traffic jams or by encouraging them to use the subway, respectively (see Section 6.1.F). Third and last, meteorological conditions may affect NO2 concentrations independently from the existence of restrictions, thus posing a threat to identification.

We will see that the impact of both public works and economic indicators is accounted for by time and unit fixed effects in our model. In contrast, weather effects may escape the control of the intersection of these two kinds of fixed effects and we thus need to incorporate them explicitly into our model.

We manually set the color of each node according to its type (outcome, treatment, mediator or confounder) using the scale_color_manual() function.

# Set the theme of the plot to a DAG theme
theme_set(theme_dag())

# Define the coordinates of the nodes in the DAG
coord_dag <- list(
  x = c("fleet_recomposition" = 1.5, "public_transport" = 1.5, "redirected_traffic" = 1.5, 
        "MC" = 0, "pollution_outside" = 3,
        "weather" = 3.5, "economic" = 2, "public_works" = 2),
  y = c("fleet_recomposition" = 1, "public_transport" = 2, "redirected_traffic" = 3, 
        "MC" = 2, "pollution_outside" = 2,
        "weather" = 0.5, "economic" = 0.5, "public_works" = 4))

# Define the DAG structure and label the nodes
pollution_dag <- dagify(pollution_outside ~ fleet_recomposition + public_transport + redirected_traffic,
                         pollution_outside ~ weather,
                         redirected_traffic ~ MC,
                         redirected_traffic ~ public_works,
                         fleet_recomposition ~ MC,
                         public_transport ~ public_works,
                         fleet_recomposition ~ economic,
                         public_transport ~ MC,
                         public_transport ~ economic,
                         labels = c(
                           "pollution_outside" = "POLLUTION\n OUTSIDE LEZ",
                           "redirected_traffic" = "(1) Redirection\n of traffic",
                           "public_transport" = "(2) Higher relative use\n of public transport",
                           "fleet_recomposition" = "(3) Fleet\n recomposition",
                           "MC" = "LEZ",
                           "weather" = "Weather",
                           "public_works" = "Public works",
                           "economic" = "Economic\n conditions"
                           ),
                         coords = coord_dag, 
                         outcome = "pollution_outside")

# Plot the DAG without text labels and with color-coding based on node type
p <- ggdag(pollution_dag, text = FALSE, use_labels = "label") 

p$layers[[3]]$mapping <- aes(colour = ifelse(
  name == "pollution_outside", "Outcome",
  ifelse(
    name == "MC", "Treatment",
    ifelse(
      name %in% c("fleet_recomposition", "public_transport", "redirected_traffic"),
      "Mediators",
      "Confounders"))))

# Add color legend
p + scale_color_manual(values = c("gray", "skyblue", "maroon", "darkgreen")) +
  # Remove legend and adjust plot margins
  theme(legend.position = "none") + 
  # Add plot title
  labs(title = "DAG: Causal Effect of Madrid LEZ on Neighbouring Areas") 

4. MODEL COVARIATES: WEATHER CONTROLS

Taking up the last point in the previous section, we want to include weather data to control for time-variant station heterogeneity (more on this in the main document).

We access weather data provided by the Spanish Meteorological Agency (Agencia Estatal de Meteorología, AEMET) through the AEMETOpenData REST API. Note that, in general, REST APIs use HTTP requests to communicate data and functionality between client and server applications. Our API specifically grants us access to data on ozone contents, satellite information, climate predictions or daily as well as monthly observations for a series of meteorological variables.

We use the function aemet_api_key() to set the key for the AEMET API, which was obtained per request and generated from the user’s email address.

# Access AEMET API
aemet_api_key("eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJlbGVuYS55dXN0cmVzQGdtYWlsLmNvbSIsImp0aSI6Ijg5MDM3MDE1LTEzMzMtNDkwZS1iZWRjLTM3NTk0ZmVkYTQ0YiIsImlzcyI6IkFFTUVUIiwiaWF0IjoxNjc3NDQ2ODA2LCJ1c2VySWQiOiI4OTAzNzAxNS0xMzMzLTQ5MGUtYmVkYy0zNzU5NGZlZGE0NGIiLCJyb2xlIjoiIn0.9N_v-cQ513FFOlcIwwXIVxXoQSYmQXCeB1yVCfJchXs", overwrite = TRUE)

4.1. Data on weather stations in the Community of Madrid

Before working with data on the meteorological variables themselves, we access data on the 13 weather stations scattered around the Community of Madrid.

After cleaning our data, we want to match each pollution monitoring station with its closest weather station. To that end, we extract a matrix of coordinates for all air pollution monitoring stations and weather stations in the Community from the ‘data_all_stations’ and the newly created ‘weather_stations’ dataset, respectively. We then calculate the distances between pollution and weather stations using (again) the Haversine formula and match each air pollution station to the closest weather station. We add this information as a new column to the ‘data_all_stations’ dataframe we created earlier.

# Access data on weather stations and make transformations
weather_stations <- aemet_stations(verbose = FALSE, return_sf = FALSE) %>% 
  filter(provincia == "MADRID") %>% 
  select(-c(indicativo, indsinop, provincia)) %>% 
  rename(altitude = altitud,
         longitude = longitud,
         latitude = latitud,
         weather_station_name = nombre)

# Create a matrix of coordinates for air pollution stations
coords_pollution <- data_all_stations[, c("longitude", "latitude")]

# Create a matrix of coordinates for weather stations
coords_weather <- weather_stations[, c("longitude", "latitude")]

# Calculate distances between each air pollution station and each weather station
distances <- distm(coords_pollution, coords_weather, fun = distHaversine)

# Find the index of the closest weather station for each air pollution station
closest_weather_index <- apply(distances, 1, which.min)

# Match each air pollution station to the closest weather station
data_all_stations$closest_weather_station <- weather_stations$weather_station_name[closest_weather_index]

4.2. Data on weather variables (2015 - 2021)

While we obtained data for 2022 to show the evolution of pollution levels using the most recent observations, pollution data after June 2021 will not be used in our model for reasons outlined later and in the main document. Hence, we will only access weather data from 2015 until the end of 2021 to keep all observations for all seven years for visualization purposes.

We now use aemet_daily_clim() to obtain daily data for all stations in Madrid. We clean the data by selecting the relevant meteorological variables as found in the literature (see Section 4 in the main document): average, minimum and maximum temperature (we will later select the most relevant out of these three), total rain, direction of maximum wind speed, average wind speed and minimum and maximum pressure (we will later compute the average to avoid collinearity). We then adjust variable types and acknowledge as missing values all observations containing the value ‘88’ for the wind direction variable following the information available in the metadata.

# Obtain daily weather values for all stations in Madrid
weather_full <- aemet_daily_clim(station = "all", start = "2015-01-01", end = "2021-12-31", 
                                 verbose = FALSE, return_sf = FALSE) %>% 
  filter(provincia == "MADRID") 

# Make transformations: select, adjust variable types and create new columns for the date
weather_full_final <- weather_full %>% 
  select(date = fecha,
         weather_station_name = nombre,
         avg_temperature = tmed,
         min_temperature = tmin, 
         max_temperature = tmax,
         total_rain = prec,
         direction_max_wind_speed = dir,
         avg_wind_speed = velmedia,
         min_pressure = presMin,
         max_pressure = presMax) %>% 
  # Replace comma with point
  mutate(total_rain = as.numeric(gsub(",", ".", total_rain)),
         direction_max_wind_speed = as.numeric(direction_max_wind_speed),
         # Obtain 'year' and 'month' columns from 'date' column
         date = ymd(date),
         year = year(date),
         month = month(date),
         day = day(date)) %>% 
  select(-date) 

# Replace all values of "88" in 'wind direction' with NAs 
weather_full_final$direction_max_wind_speed <- ifelse(weather_full_final$direction_max_wind_speed == "88", NA,
                                                      weather_full_final$direction_max_wind_speed)

4.3. Imputation of missing values in weather variables

At first glance, we notice that we face the problem of missing values in our weather variables. We will use multiple imputation to tackle this issue. This method uses the distribution of the observed data to estimate multiple plausible values that reflect the uncertainty about the missing observations (Rubin 1976). While it generally performs better than zero- or mean-/median- imputation and is most likely superior to dropping observations altogether, multiple imputation relies on a few assumptions we must check first.

Let’s start by checking the percentage of missing values in each of our potential controls:

# Create a vector of column names to exclude
exclude_cols <- c("year", "month", "day", "weather_station_name")

# Calculate the percentage of missing values for each numeric variable in the dataframe
NA_percentage <- sapply(weather_full_final[ , !(colnames(weather_full_final) %in% exclude_cols)], 
                      function(x) mean(is.na(x))) * 100

# Display the result
print(NA_percentage)
##          avg_temperature          min_temperature          max_temperature 
##                 4.382683                 4.382683                 4.376395 
##               total_rain direction_max_wind_speed           avg_wind_speed 
##                 5.854057                12.453234                10.491401 
##             min_pressure             max_pressure 
##                20.800453                20.803597

It seems that the temperature and rain variables do not pose a big problem since they only have around 5% of missing values. Wind-related variables have around 10% missing values, which is worse but still within a “normal range.” The issue is not as clear with the pressure-related variables, for which around a fifth of observations are missing.

Hence, we do some checks taking minimum pressure as our example precisely because of its relatively higher number of missing values (together with maximum pressure). Before that, however, we print its five-number summary in order to later compare these values in the variable before and after imputation.

# Print 5-number summary for 'min_pressure'
summary(weather_full_final$min_pressure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   780.4   890.0   931.0   908.1   942.9   971.9    6616

From a theoretical perspective, multiple imputation relies on the assumption of Missing at Random (MAR). This states that the probability of a data point being missing depends only on the observed data and not on the missing values themselves. While this assumption cannot be tested statistically, we can check whether missingness is associated with the station by plotting the frequency of NAs in minimum pressure by the weather station:

# Compute number of NAs for 'min_pressure' in each station 
na_counts <- weather_full_final %>% 
  group_by(weather_station_name) %>% 
  summarise(na_count = sum(is.na(min_pressure)))

# Plot histogram of missing values per station
ggplot(na_counts, aes(x = reorder(weather_station_name, -na_count), y = na_count)) +
  geom_bar(stat = "identity", fill = "lightyellow", color = "black") +
  labs(x = "Weather Station",
       y = "Number of Missing Values",
       title = "Missing Values in Minimum Pressure by Station") +
  theme_light() +
  theme(axis.text.x = element_text(angle = 70, hjust=1, size = 7))

We see that the stations in Aranjuez, Ciudad Universitaria or Somosierra have a considerably high number of missing values compared to other stations. This suggests that the missingness of ‘min_pressure’ (and for the other variables for which we repeated this process) might not be missing at random.

Nevertheless, there are two main reasons why we will proceed with the multiple imputation of our weather covariates.

First, since we include station fixed effects in our models (Section 5), we can think of missingness as an additional attribute that varies across stations besides, for instance, the amount of traffic. Hence, assuming missingness is captured by these unit fixed effects, we want to ensure that we do not observe more missing values at specific dates (which would be problematic). To check this, we create a similar plot but this time we group by date rather than station:

# Compute number of NAs for 'min_pressure' for each date 
na_counts <- weather_full_final %>% 
  group_by(year, month) %>% 
  summarise(na_count = sum(is.na(min_pressure))) %>% 
  mutate(date = as.yearmon(paste(year, month, sep = "-"), format = "%Y-%m"))

# Define the levels of the date column in chronological order
na_counts$date <- factor(na_counts$date, levels = unique(na_counts$date))

# Filter the data to select a subset of labels
selected_labels <- na_counts$date[seq(1, nrow(na_counts), length.out = 20)]

# Plot histogram of missing values per date
ggplot(na_counts, aes(x = date, y = na_count)) +
  geom_bar(stat = "identity", fill = "lightyellow", color = "black") +
  labs(x = "Date", 
       y = "Number of Missing Values",
       title = "Missing Values in Minimum Pressure by Date") +
  theme_light() +
  theme(axis.text.x = element_text(angle = 70, hjust = 1, size = 7)) +
  scale_x_discrete(breaks = selected_labels)

While March of 2018 seems to have an unusually high number of missing values, there does not seem to be any pattern indicating missingness is strongly correlated with time. We hence have a more or less balanced panel.

Second, we see that the imputed distributions of ‘min_pressure’ are rather similar to the original one in the plots below. In order to visualize how the imputed values are distributed for this specific variable, we use the package mice() (standing for ‘Multivariate Imputation by Chained Equations’) with three different algorithms: Predictive Mean Matching (“pmm”), Lasso Regression (“lasso.norm”) and Classification and Regression Trees (“cart”).

We use ten imputations for each missing value (m=10) and set a random seed to ensure the replicability of our imputation results.

set.seed(1234)

# Create dataframe with original and imputed values using different algorithms
mice_imputed <- data.frame(
  original = weather_full_final$min_pressure,
  imputed_pmm = complete(mice(weather_full_final, m=10, method = "pmm"))$min_pressure,
  imputed_cart = complete(mice(weather_full_final, m=10, method = "cart"))$min_pressure,
  imputed_lasso = complete(mice(weather_full_final, m=10, method = "lasso.norm"))$min_pressure)

We plot the resulting distributions below:

# Create histograms for the distribution of each algorithm and compare with original one
MICE_plot1 <- ggplot(mice_imputed, aes(x = original)) +
  geom_histogram(binwidth = 4, color = "black", fill = "lightblue") +
  labs(x= "Values", 
       y = "Frequency", 
       title = "Original Distribution of Minimum Pressure") +
  theme_light() +
  theme(plot.title = element_text(size = 9)) 

MICE_plot2 <- ggplot(mice_imputed, aes(x = imputed_pmm)) +
  geom_histogram(binwidth = 4, color = "black", fill = "lightpink", position = "identity") +
  labs(x= "Values", 
       y = "Frequency", 
       title = "PMM-Imputed Distribution of Minimum Pressure") +
  theme_light() +
  theme(plot.title = element_text(size = 9)) 

MICE_plot3 <- ggplot(mice_imputed, aes(x = imputed_cart)) +
  geom_histogram(binwidth = 4, color = "black", fill = "lightgreen", position = "identity") +
  labs(x= "Values", 
       y = "Frequency", 
       title = "Cart-Imputed Distribution of Minimum Pressure") +
  theme_light() +
  theme(plot.title = element_text(size = 9)) 

MICE_plot4 <- ggplot(mice_imputed, aes(x = imputed_lasso)) +
  geom_histogram(binwidth = 4, color = "black", fill = "lightsalmon", position = "identity") +
  labs(x= "Values", 
       y = "Frequency", 
       title = "Lasso-Imputed Distribution of Minimum Pressure") +
  theme_light() +
  theme(plot.title = element_text(size = 9)) 

plot_grid(MICE_plot1, MICE_plot2, MICE_plot3, MICE_plot4, nrow = 2, ncol = 2)

Although all algorithms yield approximately equivalent distributions, PMM seems to result in the closest to the original distribution. Due to this and the relatively lower required computational time, we choose PMM to impute all our weather variables.

Note that PMM is used as part of a chained equation approach, whereby in each iteration each variable with missing values is imputed by finding the nearest observed value to the missing value. It relies on the use of a regression model that includes all other variables in the dataset (both observed and imputed).

set.seed(1234)

# Specify columns to impute
cols_to_impute <- c("avg_temperature", "max_temperature", "min_temperature", 
                    "total_rain",  "direction_max_wind_speed", "avg_wind_speed", 
                    "min_pressure", "max_pressure")

# Impute missing values in the columns using PMM
imputed_data <- mice(weather_full_final[cols_to_impute], m = 10, method = "pmm")

# Update original dataframe with imputed values
weather_full_final[cols_to_impute] <- complete(imputed_data)[, cols_to_impute]

To account for information beyond the visual inspection of the imputed distributions above, we see that the 5-number summary of minimum pressure is very similar to the one of the original variable.

# Print 5-number summary for imputed variable
summary(weather_full_final$min_pressure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   780.4   893.0   932.0   909.3   943.1   971.9

Hence, imputation seems to work well with this and the rest of our eight weather variables. After imputing them, we aggregate observations by month. We extract the mean for the temperature- and wind-related variables, while we compute the sum for the rain covariate following common practice. Finally, we combine the information of the ‘min_pressure’ and ‘max_pressure’ variables into a single one (‘avg_pressure’) by averaging them out.

# Group imputed weather variables and compute monthly averages 
weather_full_final_a <- weather_full_final %>% 
  group_by(weather_station_name, year, month) %>% 
  summarise(avg_temperature_monthly = round(mean(avg_temperature), 2),
            min_temperature_monthly = round(min(min_temperature), 2),
            max_temperature_monthly = round(max(min_temperature), 2),
            total_rain_monthly = round(sum(total_rain), 2),
            avg_wind_speed_monthly = round(mean(avg_wind_speed), 2),
            wind_direction_monthly = round(mean(direction_max_wind_speed), 2),
            avg_pressure_monthly = round(mean((min_pressure + max_pressure)/2), 2))

In the event that there are still doubts regarding the assumptions of the imputation procedure carried out in this section, Section 7.1. goes throughout an alternative strategy involving weights based on probabilities of missingness.

4.4. Describing model covariates

Now that we our variables are free of missing values, we will create a series of visualizations to describe how they vary across station rings as well as through the time window under study.

We will work with the following variables (at the monthly level):

  1. (Average) temperature (ignore maximum and minimum until the next section)

  2. (Total) rain

  3. (Average) wind speed

  4. (Average) direction of maximum wind speed

  5. (Average) pressure

Weather variables across station rings

We start by exploring the relationship between our five weather variables and station rings. Recall that Ring 1 is the closest to, and Ring 6 is the farthest away from, Plaza del Carmen (city centre). We merge weather data with our dataset containing information about air quality monitoring stations and the corresponding closest weather station for each by the latter variable. We do this because we are interested in visualizing weather variables as a function of the ring of pollution stations each corresponding weather station is closest to.

Importantly, we replace missing values for the ‘ring’ column in the newly created dataframe by “Ring 6: Excluded”. The reason is that all missing values corresponded to the weather station in Somosierra since it was not selected as the closest weather station for any of our pollution stations in the computation performed earlier. As such, weather observations taken in Somosierra had no information about the ring and we thus had to assign the last ring to them manually.

We then use boxplots to compare the distributions of our variables in a given year (2019 for instance, as it is the first year of full implementation of Madrid Central). We select only average temperature to avoid redundancies, pivot our data, recode the newly created weather variable to select its levels manually and display the plot.

# Merge weather data with pollution station data by weather pollution station
weather_all <- weather_full_final_a %>% left_join(data_all_stations, 
                                                by = c("weather_station_name" = "closest_weather_station")) %>% 
  select(-c(town_code, station_code, station_type, address, longitude, latitude, distance_km)) 

# Replace all NA values in 'ring' with the level "Ring 6: Excluded" 
weather_all$ring <- fct_explicit_na(weather_all$ring, na_level = "Ring 6: Excluded")

# Create the plot
weather_all %>%
  # Remove minimum and maximum temperature variables for simplicity
  select(-c(min_temperature_monthly, max_temperature_monthly)) %>% 
  # Pivot weather data to long format
  pivot_longer(cols = c("avg_temperature_monthly", "total_rain_monthly", "wind_direction_monthly",
                        "avg_wind_speed_monthly", "avg_pressure_monthly"),
               names_to = "weather_variable", values_to = "values") %>%
  # Transform variable with weather variables as categories into factor and recode 
  mutate(weather_variable = factor(weather_variable),
         weather_variable = case_when(
           weather_variable == "avg_temperature_monthly" ~ "Average Temperature",
           weather_variable == "total_rain_monthly" ~ "Total Rain",
           weather_variable == "wind_direction_monthly" ~ "Average Wind Direction",
           weather_variable == "avg_wind_speed_monthly" ~ "Average Wind Speed",
           weather_variable == "avg_pressure_monthly" ~ "Average Pressure")) %>%
  filter(year == 2019) %>% 
  # Plot the data with boxplots grouped by ring
  ggplot(aes(x = ring, y = values, fill = ring)) +
  geom_boxplot() +
  facet_wrap(weather_variable ~ ., scales = "free_y") +
  labs(x = "Ring", 
       y = "Measurement", 
       title= "Distribution of Weather Variables by Station Ring (2019)", 
       fill = "Station Ring") +
  scale_fill_manual(values = ring_colors) +
  theme_light() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 8),
        plot.title = element_text(size = 12),
        legend.position = "none")

We see that for total rain and average temperature there are no significant differences across the different stations grouped into rings as the medians are quite similar. In contrast, wind-related variables and average pressure seem to be subject to larger differences. Stations in Ring 3 appear to have higher average wind direction (and wind speed but with smaller differences compared to Rings 4 and 5). For pressure, the medians are not that different but the distributions (especially the minimum values) appear to be.

Weather variables across station rings and time (2019)

We saw some differences with data aggregated at the yearly level. However, we want to assess whether there are any interesting patterns in the evolution of our weather variables for a given year - again, 2019 - across station rings.

We perform similar data transformation operations as before, with the difference that now we are actually plotting the mean value of each weather variable for each station ring in each month of 2019. We use lines color-coded by station ring, with each weather variable plotted separately in its own facet.

# Create the plot
weather_all %>%
  # Remove minimum and maximum temperature variables for simplicity
  select(-c(min_temperature_monthly, max_temperature_monthly)) %>% 
  # Pivot the data to long format
  pivot_longer(cols = c("avg_temperature_monthly", "total_rain_monthly", "wind_direction_monthly",
                        "avg_wind_speed_monthly", "avg_pressure_monthly"),
               names_to = "weather_variable", values_to = "values") %>%
  # Transform and recode the weather variable
  mutate(weather_variable = as.factor(weather_variable),
         weather_variable = case_when(
           weather_variable == "avg_temperature_monthly" ~ "Average Temperature",
           weather_variable == "total_rain_monthly" ~ "Total Rain",
           weather_variable == "wind_direction_monthly" ~ "Average Wind Direction",
           weather_variable == "avg_wind_speed_monthly" ~ "Average Wind Speed",
           weather_variable == "avg_pressure_monthly" ~ "Average Pressure")) %>%
  # Filter by year 2019
  filter(year == 2019) %>% 
  # Group by ring and month and calculate the mean value of each weather variable
  group_by(ring, month, weather_variable) %>%
  summarise(mean_value = mean(values)) %>%
  ungroup() %>%
  # Plot the data with lines color-coded by ring
  ggplot(aes(x = month, y = mean_value, color = ring)) +
  geom_line() +
  facet_wrap(weather_variable ~ ., scales = "free_y") +
  labs(x = "Month", 
       y = "Measurement", 
       title= "Evolution of Monthly Averages of Weather Variables by Station Ring (2019)", 
       color = "Ring") +
  scale_color_manual(values = ring_colors) +
  scale_x_continuous(breaks = c(2, 5, 8, 11), 
                     labels = c("Feb", "May", "Aug", "Nov")) +
  theme_light() +
  theme(axis.text.x = element_text(hjust = 1, size = 8),
        plot.title = element_text(size = 12))

For some variables (e.g. temperature and, to a lesser extent, pressure and wind speed), the trends seem to be more or less parallel across station rings. In these cases, adding month and station fixed effects (see Section 5) could potentially be enough since temperature and pressure seem to be varying consistently through time across rings. However, meteorological conditions still change over time for all rings and the other variables appear to be subject to even stronger time-varying changes across rings. Thus, the inclusion of these variables into our models is essential to control for heterogeneity which could hamper our estimates.

5. MODELLING POLLUTION SPILLOVER EFFECTS

With all the necessary data loaded and (almost) properly transformed, we are ready to kick off the modelling stage. We will employ a Difference-in-Differences (DiD) design (Snow 1885) to uncover whether there were any spatial pollution spillovers as a result of the introduction of restrictions in Madrid Central, and if so, how “big” they are and how far out they are significant.

5.2. Creation of model variables

Merging of dataframes into a final one with complete information for the model

Now we perform our last merging of dataframes in order to get all the attributes needed for running the models into ‘did_df’. The dataframe ‘model_pollution_df’ contains information about dates, pollution and stations (location, type, distance from the centre, ring and closest weather station) while ‘weather_full_final_a’ has data on dates, weather stations and weather variables. We can hence join them by the name of the weather station and the date. The former has different names in each dataframe and so we rename it in the first before merging it with the second:

# Rename weather station variable so that it is the same in both datasets
model_pollution_df <- model_pollution_df %>% 
  rename(weather_station_name = closest_weather_station)

# Merge pollution and weather data
did_df <- model_pollution_df %>% 
  left_join(weather_full_final_a, 
            by = c("year", "month", "weather_station_name"))

Pre- vs. post-treatment dummy

With our final dataframe ready, we can now create our main variables for the DiD models. We first need a dummy taking the value of 1 for observations happening after the introduction of Madrid Central measures (i.e. after and including December 2018), and a value of 0 otherwise.

# Create post-treatment dummy (POST = 1 if year is greater than 2018 or if equal to 2018 and the month is December)
did_df <- did_df %>%
  mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))

Ring treatment dummies

Next, we create a series of treatment dummies for each ring to include in our model. We mutate four new columns in the dataframe that take on a value of 1 if the corresponding observation is in the treated group for that ring, and 0 otherwise.

# Create treatment dummies for each ring
did_df <- did_df %>%
  mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
         treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
         treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
         treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0))

Interaction terms between post-treatment and ring treatment dummies

The logical next step within the DiD framework is producing new variables that identify whether a given observation is in the post-treatment period for each treatment group. This is an interaction term that “activates” the treatment for each group only in the period after the LEZ’s entry into force.

# Create interaction terms by multiplying POST by each corresponding treatment ring
did_df <- did_df %>%
  mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
         POST_treat_2 = POST*as.numeric(treatment_ring2),
         POST_treat_3 = POST*as.numeric(treatment_ring3),
         POST_treat_4 = POST*as.numeric(treatment_ring4))

The resulting variables have a value of 0 for observations in the control group or in the pre-treatment period, and will have the same value as the ‘POST’ variable for observations in the treated group (depending on the ring) in the post-treatment period.

Unique station identifier

Since our units of analysis are pollution monitoring stations, we would like to create a variable (‘station_id’) that uniquely identifies each station in the Community of Madrid air quality network. We simply paste the station’s unique code with the code of the town it belongs to.

# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df$station_id <- as.factor(paste(did_df$town_code, did_df$station_code, sep="-"))

# Print number of different station codes
length(unique(did_df$station_id))
## [1] 40

We see that, after removing stations with different characteristics than the ones included as controls (i.e. Ring 6), we are left with 40 different air quality monitoring stations. We will use these codes for producing clustered standard errors and creating station fixed effects.

5.3. Pre-modelling feature selection

Correlation between temperature controls and the outcome

We are done with feature engineering on the side of the “treatment-related variables.” As for the controls, before we mentioned we would like to choose just one out of all three variables containing information about temperature to avoid multicollinearity.

To be sure we are including the most relevant one to explain pollution, we compute the correlation of each of the three variables with our outcome.

# Calculate correlation coefficients
correlations <- cor(did_df[c("pollution_level", 
                             "avg_temperature_monthly", 
                             "min_temperature_monthly",
                             "max_temperature_monthly")])

# Extract correlation coefficients for outcome 
pollution_correlations <- correlations[1, 2:4]

# Print absolute values of correlation coefficients
print(round(abs(pollution_correlations), 2))
## avg_temperature_monthly min_temperature_monthly max_temperature_monthly 
##                    0.41                    0.32                    0.38

Average temperature has the highest correlation with pollution, over 0.4 in absolute value. Therefore, we remove the other two variables. We also remove variables containing redundant information from our dataset.

# Remove 'min_temperature_monthly' and 'max_temperature_monthly' from the dataframe
did_df <- did_df %>% select(-c(min_temperature_monthly, max_temperature_monthly))

# Remove redundant variables
did_df <- did_df %>% select(-c(town_code, station_code, type))

Correlation among all weather variables with the outcome

Since we now have the final list of our five weather controls, we can check their degree of association with the outcome variable like we did with average temperature as measured by Pearson’s correlation coefficient. Hence, we select the six variables (outcome and five weather controls), compute the correlations and make some adjustments to our correlation plot:

# Select columns containing the variables of interest
cols_of_interest <- c("pollution_level", "avg_temperature_monthly",
                      "total_rain_monthly", "wind_direction_monthly", "avg_wind_speed_monthly", 
                      "avg_pressure_monthly")

# Compute correlations
cc <- cor(did_df[,cols_of_interest])

# Create an empty plot with a larger top margin
par(mar = c(5, 4, 1, 2))

# Customize the correlation plot and add it to the empty plot
corrplot(cc, method="color", type="upper", order="hclust", tl.col="black", 
         tl.srt=20, tl.cex=0.8, tl.offset=0.4, diag=FALSE, addrect = 4, 
         addCoef.col = "black", col = colorRampPalette(c("#FFFFFF", "#7EBCEB", "#08306B"))(200), 
         mar=c(0,0,1,0))

# Add title to the plot 
mtext("Correlation Plot of Weather Controls and Pollution Level", side = 3, line = -0.1, cex = 1.1)

It is clear that correlations among our weather variables are not excessively high, the largest one being the (obviously negative) association between average temperature and total rain measured on a monthly basis.

We also see that, among all weather variables, average temperature and average wind speed seem to be the most highly correlated with pollution. Both have moderate negative linear associations with the outcome variable. Pressure, rain and wind direction are barely associated (linearly) with pollution since their correlation coefficients are close to 0. Still, even if some variables may not be highly correlated with the dependent variable in our models, we will still include them because they may have explanatory power in conjunction with other variables. Furthermore, these covariates may have non-linear relationships with pollution which would go uncaptured by the correlation coefficient (see Section 6.1.H). More importantly, we should be concerned that some of these variables (without concrete knowledge of which) may be responsible for some variability in the dependent variable that must be accounted for to avoid biased estimates in our model.

5.4. Summary of outcome, explanatory and control variables in the model

We create a table summarising the variables in our model, including their names in this code, description and basic summary statistics (mean and standard deviation).

# Define variable names 
var_names <- c("pollution_level", "POST", 
               "treatment_ring1", "treatment_ring2", "treatment_ring3", "treatment_ring4",
               "avg_temperature_monthly", "total_rain_monthly", "wind_direction_monthly", 
               "avg_wind_speed_monthly", "avg_pressure_monthly")

# Define variable descriptions
var_descriptions <- c("Average monthly nitrogen dioxide (NO2) in micrograms per cubic meter (µg/m3)", 
      "Dummy: 1 if observation is after the date of implementation of the LEZ (November 2018) and 0 otherwise", 
      "Dummy: 1 if pollution monitoring station is in Plaza del Carmen, Plaza de España, Escuelas Aguirre or Parque del Retiro, and 0 otherwise",
      "Dummy: 1 if pollution monitoring station is in Castellana, Méndez Alvaro, Cuatro Caminos, Farolillo, Casa de Campo, Plaza Elíptica, Ramon y Cajal, Moratalaz, Plaza Castilla, Vallecas, Arturo Soria or Barrio del Pilar, and 0 otherwise",
      "Dummy: 1 if pollution monitoring station is in Villaverde, Juan Carlos I, Sanchinarro, Tres Olivos, Ensanche de Vallecas, Leganés, Urb. Embajada, Getafe, Barajas Pueblo, El Pardo, Coslada, Alcorcón, Alcobendas or Majadahonda, and 0 otherwise",
      "Dummy: 1 if pollution monitoring station is in Rivas-Vaciamadrid, Fuenlabrada, Móstoles or Torrejón de Ardoz, and 0 otherwise",
      "Monthly average of daily average temperatures in Celsius (C)",
      "Monthly total precipitation in millimeters (mm)",
      "Monthly average direction of daily maximum wind speed in tens of degrees",
      "Monthly average of daily average windspeed in meters per second (m/s)",
      "Monthly average of daily minimum and maximum pressure in hectopascals (hPA)")

# Create dataframe with variables names and descriptions
variable_descriptions_df <- data.frame(variable_name = var_names, variable_description = var_descriptions)

# Convert 'variable_name' to a factor variable with the desired order of levels
variable_descriptions_df$variable_name <- factor(variable_descriptions_df$variable_name, 
                                                 levels = var_names, 
                                                 ordered = TRUE)

# Calculate summary statistics for the chosen variables
summary_df <- data.frame(
  variable_name = var_names,
  mean = round(sapply(did_df[, var_names], mean), 2),
  SD = round(sapply(did_df[, var_names], sd), 2))

# Remove the row names from the summary dataframe
rownames(summary_df) <- NULL

# Merge summary statistics with variable descriptions
result_df <- variable_descriptions_df %>% left_join(summary_df, by = "variable_name")

# Rename columns in the final dataframe
names(result_df) <- c("Variable Name", "Variable Description", "Mean", "Standard Deviation")

# Order the rows of 'result_df' by the order of levels in variable_name
result_df <- result_df[order(variable_descriptions_df$variable_name),]

# Format table 
kable(result_df, format = "html", row.names = FALSE) %>%
  kable_styling(c("striped", "hover", "bordered"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) 
Variable Name Variable Description Mean Standard Deviation
pollution_level Average monthly nitrogen dioxide (NO2) in micrograms per cubic meter (µg/m3) 32.73 14.47
POST Dummy: 1 if observation is after the date of implementation of the LEZ (November 2018) and 0 otherwise 0.40 0.49
treatment_ring1 Dummy: 1 if pollution monitoring station is in Plaza del Carmen, Plaza de España, Escuelas Aguirre or Parque del Retiro, and 0 otherwise 0.10 0.30
treatment_ring2 Dummy: 1 if pollution monitoring station is in Castellana, Méndez Alvaro, Cuatro Caminos, Farolillo, Casa de Campo, Plaza Elíptica, Ramon y Cajal, Moratalaz, Plaza Castilla, Vallecas, Arturo Soria or Barrio del Pilar, and 0 otherwise 0.30 0.46
treatment_ring3 Dummy: 1 if pollution monitoring station is in Villaverde, Juan Carlos I, Sanchinarro, Tres Olivos, Ensanche de Vallecas, Leganés, Urb. Embajada, Getafe, Barajas Pueblo, El Pardo, Coslada, Alcorcón, Alcobendas or Majadahonda, and 0 otherwise 0.35 0.48
treatment_ring4 Dummy: 1 if pollution monitoring station is in Rivas-Vaciamadrid, Fuenlabrada, Móstoles or Torrejón de Ardoz, and 0 otherwise 0.10 0.30
avg_temperature_monthly Monthly average of daily average temperatures in Celsius (C) 15.27 7.19
total_rain_monthly Monthly total precipitation in millimeters (mm) 34.18 31.47
wind_direction_monthly Monthly average direction of daily maximum wind speed in tens of degrees 24.73 8.41
avg_wind_speed_monthly Monthly average of daily average windspeed in meters per second (m/s) 2.64 0.88
avg_pressure_monthly Monthly average of daily minimum and maximum pressure in hectopascals (hPA) 934.87 19.87

5.5. Manual calculation of the average treatment effect (ATT)

While the causal effect of interest is usually extracted by means of regression, we can first exemplify the logic behind DiD by performing the simple calculation below.

We rely on the potential outcome framework. For the sake of simplicity, we assume that only stations inside and immediately outside the LEZ (Plaza del Carmen, Plaza de España, Escuelas Aguirre and Parque del Retiro) are treated (“treatment_ring1==1”) and the rest are not (“treatment_ring1==0”). Recall we also have the time variable indicating whether a given observation was taken before or after the introduction of the policy (“POST==0” and “POST==1”, respectively).

By subtracting the difference in (mean) pollution levels between the treatment (our four stations) and control groups (the remaining stations) before the intervention from the (mean) difference in pollution levels between the treatment and control units after the intervention, we are able to approximate the Average Treatment Effect (ATT). In our case, it is the difference in pollution levels between the treatment group and the control groups, after accounting for any pre-existing differences in pollution levels between both groups.

We display the means for each of the four cases in the table below:

# Mean pollution level for post-treatment and treatment group 1
posttreat <- mean(did_df$pollution_level[did_df$POST==1 & did_df$treatment_ring1==1])

# Mean pollution level for pre-treatment and treatment group 1
pretreat <- mean(did_df$pollution_level[did_df$POST==0 & did_df$treatment_ring1==1])

# Mean pollution level for post-treatment and "control" group (all other stations)
postcontrol <- mean(did_df$pollution_level[did_df$POST==1 & did_df$treatment_ring1==0])

# Mean pollution level for pre-treatment and control group
precontrol <- mean(did_df$pollution_level[did_df$POST==0 & did_df$treatment_ring1==0])

# Calculate the difference in means (DID) between treatment and control groups
did_att <- (posttreat-postcontrol)-(pretreat-precontrol)

# Create a 3x3 table
table_ATT <- matrix(0, nrow = 3, ncol = 3)

# Fill in the table with the differences
table_ATT[1, 1] <- round(pretreat, 2)
table_ATT[1, 2] <- round(posttreat, 2)
table_ATT[1, 3] <- round(posttreat - pretreat, 2)

table_ATT[2, 1] <- round(precontrol, 2)
table_ATT[2, 2] <- round(postcontrol, 2)
table_ATT[2, 3] <- round(postcontrol - precontrol, 2)

table_ATT[3, 1] <- round(pretreat - precontrol, 2)
table_ATT[3, 2] <- round(posttreat - postcontrol, 2)
table_ATT[3, 3] <- round((posttreat - postcontrol) - (pretreat - precontrol), 2)

# Add row and column names
row.names(table_ATT) <- c("TREATMENT", "CONTROL", "DIFFERENCE")
colnames(table_ATT) <- c("PRE", "POST", "DIFFERENCE")

# Print the table
kable(table_ATT, format = "html") %>%
  kable_styling(c("striped", "bordered"), full_width = FALSE)
PRE POST DIFFERENCE
TREATMENT 45.80 33.04 -12.76
CONTROL 34.45 27.88 -6.57
DIFFERENCE 11.35 5.16 -6.19

We obtain an ATT of -6.19 (third row, third column - the result of taking the “double difference”), which reflects the average reduction in pollution as measured by NO2 concentrations for the stations in Plaza del Carmen, Plaza de España, Escuelas Aguirre and Parque del Retiro as a result of the policy.

5.6. Difference-in-Differences Regressions

It is time to run our DiD regressions. In practice, DiD identification strategies are implemented through fixed-effects models. We run five different ones using the plm() R package, each with an increasing number of predictors and thus degree of complexity compared to the previous one.

First of all we transform the month column to a string so that we are later able to remove the 11 levels of that variable from the regression table summarising our five models for the sake of simplicity. We transform the time, treatment and interaction terms into factors.

# Convert month variables to strings
did_df$month <- as.character(did_df$month)

# Transform new variables into factors
did_df <- did_df %>%
  mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))

The models in this section rely on within-group transformations (i.e. ‘model’ argument set to “within”), which allows us to account for individual-specific unobserved heterogeneity (i.e. unit fixed effects) that may affect the dependent variable (‘pollution_level’) by subtracting the average of each station’s observations from the observations of that given station. This transformation effectively removes time-invariant differences across stations, leaving only the within-individual variation to be explained by the regressors and allowing different intercepts for different stations.

Model 1: first treatment ring without weather controls

We start with the simplest possible model: we regress the pollution level on the time, treatment and interaction dummies assuming only Plaza del Carmen and the three stations closest to it are treated. We include time (in this case, month) fixed effects since our data is at the monthly level and these are thus likely to capture more variability than year fixed effects (see Section 6.1.G). The model automatically incorporates station fixed effects when we set the ‘index’ argument equal to “station_id”.

# Set up the fixed effects model
m1 <- plm(pollution_level ~ POST + treatment_ring1 + POST_treat_1 +
          month,
          model = "within", index = "station_id",
          data = did_df)

summary(m1)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = pollution_level ~ POST + treatment_ring1 + POST_treat_1 + 
##     month, data = did_df, model = "within", index = "station_id")
## 
## Balanced Panel: n = 40, T = 78, N = 3120
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -24.987526  -3.985351  -0.014801   3.700320  33.151361 
## 
## Coefficients:
##                Estimate Std. Error  t-value  Pr(>|t|)    
## POST1          -6.97690    0.26836 -25.9987 < 2.2e-16 ***
## POST_treat_11  -6.19026    0.84395  -7.3348 2.828e-13 ***
## month10        -8.79217    0.60928 -14.4305 < 2.2e-16 ***
## month11        -5.70158    0.60928  -9.3580 < 2.2e-16 ***
## month12        -0.53002    0.60907  -0.8702    0.3842    
## month2         -8.60203    0.58491 -14.7066 < 2.2e-16 ***
## month3        -16.30194    0.58491 -27.8709 < 2.2e-16 ***
## month4        -23.15104    0.58491 -39.5805 < 2.2e-16 ***
## month5        -24.76630    0.58491 -42.3421 < 2.2e-16 ***
## month6        -24.39553    0.58491 -41.7082 < 2.2e-16 ***
## month7        -22.69518    0.60928 -37.2494 < 2.2e-16 ***
## month8        -23.70388    0.60928 -38.9050 < 2.2e-16 ***
## month9        -14.89930    0.60928 -24.4541 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    445050
## Residual Sum of Squares: 146900
## R-Squared:      0.66992
## Adj. R-Squared: 0.66433
## F-statistic: 478.829 on 13 and 3067 DF, p-value: < 2.22e-16

All our variables seem to be significant (except for one level in the ‘month’ variable) and the R-squared is 0.67, meaning 67% of the variability in the pollution level is explained by the variables in our model.

However, we might want to introduce clustered standard errors since residuals may not be evenly distributed across the line of best fit and are thus not “iid” (independent and identically distributed). This could be due to heterogeneity across stations (due to location, closeness to roads or centrality in the region). Hence, to ensure the need for clustering, we perform a Breusch-Pagan test for heteroskedasticity on our model.

# Breusch-Pagan test for heteroskedasticity 
bp <- bptest(m1)
print(bp)
## 
##  studentized Breusch-Pagan test
## 
## data:  m1
## BP = 246.31, df = 14, p-value < 2.2e-16

Since the p-value is clearly below the significance level of 0.05, we reject the null hypothesis of homoskedastic residuals and conclude that there is evidence of heteroskedasticity in the model.

Therefore, we compute a t-test with clustered standard errors on our simplest model using the coeftest() function while specifying the use of heteroskedasticity-consistent (HC) standard errors. After trying different values for ‘type’, we select HC1 which specifies that the standard errors should be calculated using the “sandwich estimator.” This yields valid (consistent) standard errors and more accurate p-values in the presence of heteroskedasticity and/or autocorrelation. Recall that these two violate the assumptions of ordinary least squares (OLS) regression as the treatment is unable to predict the outcome evenly across stations.

It is true that several options (HC0 through HC3) are available for the type of standard errors. For reference, as we move from HC1 to HC2, the standard error increases, so the critical value decreases and the p-value rises. In other words, as we raise the “HC type”, it becomes increasingly more difficult to find significance. Hence, we will stay at HC1 to be conservative with respect to HC0, but not as much in order to allow for significant effects. Still, the choice is not incredibly relevant as we obtain the same results for all “HC types” in terms of which variables are significant.

# Compute t-test with clustered standard errors (HC1)
clustered_test_m1  <- coeftest(m1, vcov = vcovHC(m1, type = "HC2", cluster = "group"))
print(clustered_test_m1)
## 
## t test of coefficients:
## 
##                Estimate Std. Error  t value  Pr(>|t|)    
## POST1          -6.97690    0.43991 -15.8599 < 2.2e-16 ***
## POST_treat_11  -6.19026    1.53531  -4.0319 5.667e-05 ***
## month10        -8.79217    0.41961 -20.9532 < 2.2e-16 ***
## month11        -5.70158    0.27426 -20.7890 < 2.2e-16 ***
## month12        -0.53002    0.21533  -2.4615   0.01389 *  
## month2         -8.60203    0.33125 -25.9683 < 2.2e-16 ***
## month3        -16.30194    0.47108 -34.6053 < 2.2e-16 ***
## month4        -23.15104    0.63093 -36.6934 < 2.2e-16 ***
## month5        -24.76630    0.66914 -37.0124 < 2.2e-16 ***
## month6        -24.39553    0.72300 -33.7423 < 2.2e-16 ***
## month7        -22.69518    0.71482 -31.7495 < 2.2e-16 ***
## month8        -23.70388    0.77106 -30.7420 < 2.2e-16 ***
## month9        -14.89930    0.59572 -25.0106 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our treatment effect (coefficient for ‘POST_treat_1’) for the closest four stations to the city centre (those within a range of around 2.5 kilometres from Plaza del Carmen) is around -6.2, which is the same as the result of our manual calculation of the ATT. This a very clear negative (causal) effect of the implementation of restrictions within Madrid Central on the pollution levels around these four stations.

We also see that all levels of the ‘month’ variable are now significant and have negative coefficients, reflecting the insights introduced in Section 5.1 (Parallel Trends) where aggregate pollution levels showed a negative time trend. In absolute value, the magnitude of this decrease seems to be strongest for the months of May and June and weakest for December as seen in Section 2.4.

We can investigate further how the estimates for Ring 1 vary with time. Instead of using months, we will use years to avoid having an excessive number of time points. For that we use a different package (feols()) to estimate normal OLS with our station and (now) year fixed effects (see Section 6.1.G for more on year fixed effects). We choose the year of implementation, 2018, as reference year (not shown in the plot).

We perform a series of transformations and plot our estimates for the treatment effect of Ring 1 stations for each given year:

# Convert 'year' into a factor and relevel it with the reference level set to '2018'
did_df_feols <- did_df %>%
  mutate(year = relevel(factor(year), ref = '2018'))

# Run `feols()` model
FE_OLS <- feols(pollution_level ~ I(treatment_ring1 == 1)*year | station_id + year,
                cluster = "station_id",
                data = did_df_feols)

# Specify x-axis labels for the years
year_labels <- c("2015", "2016", "2017", "2019", "2020", "2021")

# Create a dataframe with coefficient estimates, standard errors and year labels
coef_df <- data.frame(coef = coef(FE_OLS), se = se(FE_OLS), year = year_labels)

# Convert to factor
coef_df <- coef_df %>% mutate(year = as.factor(year))

# Create the estimates plot 
ggplot(coef_df, aes(x = year, y = coef)) +
  geom_errorbar(aes(ymin = coef - se, ymax = coef + se), width = 0.4) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
  labs(x = "Year",
       y = "Estimate (95% Confidence Interval)",
       title = "Dynamic Treatment Effects for Ring 1 Stations by Year") +
  theme_light()

We see that before 2018, the confidence intervals for the treatment effect contained 0. However, from 2019 onwards, these are consistently in negative territory. Moreover, it seems that the (negative) impact of the policy on pollution was actually higher in 2020 compared to the first year of full implementation, thereby suggesting a strengthening of treatment effects in the four Ring 1 stations over time.

Model 2: first treatment ring with weather controls

For Model 2 we still take our four most central stations as the only treated units. Yet, apart from controlling for the month and station through fixed effects, we add our weather covariates. We also remove the treatment dummies from the regression model as the plm() package produces the same results with and without them (i.e. it understands the treatment dummy to be implicitly there).

# Set up the fixed effects model
m2 <- plm(pollution_level ~ POST + POST_treat_1 +
              avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
              wind_direction_monthly + avg_pressure_monthly +
              month,
              model = "within", index = "station_id",
              data = did_df)

summary(m2)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + avg_temperature_monthly + 
##     total_rain_monthly + avg_wind_speed_monthly + wind_direction_monthly + 
##     avg_pressure_monthly + month, data = did_df, model = "within", 
##     index = "station_id")
## 
## Balanced Panel: n = 40, T = 78, N = 3120
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -22.519760  -3.759124  -0.078218   3.701868  33.160595 
## 
## Coefficients:
##                            Estimate  Std. Error  t-value  Pr(>|t|)    
## POST1                    -5.9473473   0.2432304 -24.4515 < 2.2e-16 ***
## POST_treat_11            -4.8299489   0.7353164  -6.5685 5.948e-11 ***
## avg_temperature_monthly  -0.1435042   0.0722170  -1.9871 0.0469974 *  
## total_rain_monthly       -0.0513408   0.0044325 -11.5827 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9349604   0.2085901 -23.6586 < 2.2e-16 ***
## wind_direction_monthly   -0.0750806   0.0227845  -3.2952 0.0009946 ***
## avg_pressure_monthly     -0.0230897   0.0194141  -1.1893 0.2344021    
## month10                  -8.0155215   0.8938668  -8.9672 < 2.2e-16 ***
## month11                  -4.9007961   0.6203175  -7.9005 3.841e-15 ***
## month12                  -2.1510398   0.5438896  -3.9549 7.829e-05 ***
## month2                   -6.8213455   0.5372431 -12.6969 < 2.2e-16 ***
## month3                  -11.9980392   0.6287979 -19.0809 < 2.2e-16 ***
## month4                  -18.4655908   0.7748336 -23.8317 < 2.2e-16 ***
## month5                  -21.3357585   0.9996020 -21.3443 < 2.2e-16 ***
## month6                  -21.0576208   1.2912756 -16.3076 < 2.2e-16 ***
## month7                  -19.7359705   1.5895186 -12.4163 < 2.2e-16 ***
## month8                  -21.4271853   1.4938446 -14.3437 < 2.2e-16 ***
## month9                  -13.9027945   1.1942037 -11.6419 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    445050
## Residual Sum of Squares: 110570
## R-Squared:      0.75155
## Adj. R-Squared: 0.74692
## F-statistic: 514.578 on 18 and 3062 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m2 <- coeftest(m2, vcov = vcovHC(m2, type = "HC1", cluster = "group"))
print(clustered_test_m2)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -5.9473473   0.4577064 -12.9938 < 2.2e-16 ***
## POST_treat_11            -4.8299489   1.5337223  -3.1492  0.001653 ** 
## avg_temperature_monthly  -0.1435042   0.0829049  -1.7310  0.083561 .  
## total_rain_monthly       -0.0513408   0.0056766  -9.0443 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9349604   0.3855256 -12.8006 < 2.2e-16 ***
## wind_direction_monthly   -0.0750806   0.0246257  -3.0489  0.002317 ** 
## avg_pressure_monthly     -0.0230897   0.0426763  -0.5410  0.588518    
## month10                  -8.0155215   0.9742748  -8.2272 2.802e-16 ***
## month11                  -4.9007961   0.5286663  -9.2701 < 2.2e-16 ***
## month12                  -2.1510398   0.2750942  -7.8193 7.251e-15 ***
## month2                   -6.8213455   0.4013852 -16.9945 < 2.2e-16 ***
## month3                  -11.9980392   0.6871073 -17.4617 < 2.2e-16 ***
## month4                  -18.4655908   0.9147885 -20.1856 < 2.2e-16 ***
## month5                  -21.3357585   1.2068775 -17.6785 < 2.2e-16 ***
## month6                  -21.0576208   1.5158347 -13.8918 < 2.2e-16 ***
## month7                  -19.7359705   1.8265641 -10.8050 < 2.2e-16 ***
## month8                  -21.4271853   1.7942757 -11.9420 < 2.2e-16 ***
## month9                  -13.9027945   1.3480227 -10.3135 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that all controls are statistically significant at the 0.05 significance level except average pressure and average temperature. Our adjusted R-squared (which penalizes the inclusion of irrelevant variables into the model) has risen to around 74.7%. This means that meteorological information is actually relevant at explaining the variability in pollution levels.

Our treatment effect for the closest four stations to the city centre is now around -4.8 with a p-value of 0.002. That is, the policy has led to a significant decrease of 4.8 µg/m3 in NO2 concentrations inside and within the immediate neighbourhood of Madrid Central after controlling for the effects of meteorological conditions. This is lower than the effect found in Model 1 in absolute value. This reduction means that, by incorporating weather controls, we were able to “strip away” some of the effect on pollution that was actually not caused by the restrictions in Madrid Central but by fluctuating meteorological conditions.

Model 3: first and second treatment rings with weather controls

Model 3 retains the same fixed effects and controls, but we now incorporate the second treatment ring (stations within 2.5 and 10 kilometres from Plaza del Carmen) in addition to the first.

# Set up the fixed effects model
m3 <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly +
            month,
            model = "within", index = "station_id",
            data = did_df)

summary(m3)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + POST_treat_2 + 
##     avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
##     wind_direction_monthly + avg_pressure_monthly + month, data = did_df, 
##     model = "within", index = "station_id")
## 
## Balanced Panel: n = 40, T = 78, N = 3120
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -22.506609  -3.815558  -0.084024   3.734732  32.810488 
## 
## Coefficients:
##                            Estimate  Std. Error  t-value  Pr(>|t|)    
## POST1                    -5.5476293   0.2942513 -18.8534 < 2.2e-16 ***
## POST_treat_11            -5.2336318   0.7535959  -6.9449 4.607e-12 ***
## POST_treat_21            -1.1955516   0.4961391  -2.4097 0.0160238 *  
## avg_temperature_monthly  -0.1553601   0.0723279  -2.1480 0.0317922 *  
## total_rain_monthly       -0.0512628   0.0044292 -11.5739 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9003670   0.2089204 -23.4557 < 2.2e-16 ***
## wind_direction_monthly   -0.0802765   0.0228685  -3.5103 0.0004540 ***
## avg_pressure_monthly     -0.0228754   0.0193990  -1.1792 0.2384096    
## month10                  -7.8904847   0.8946720  -8.8194 < 2.2e-16 ***
## month11                  -4.8495608   0.6201957  -7.8194 7.245e-15 ***
## month12                  -2.1150944   0.5436679  -3.8904 0.0001022 ***
## month2                   -6.8103280   0.5368414 -12.6859 < 2.2e-16 ***
## month3                  -11.9726520   0.6283933 -19.0528 < 2.2e-16 ***
## month4                  -18.4029092   0.7746630 -23.7560 < 2.2e-16 ***
## month5                  -21.2153214   1.0000680 -21.2139 < 2.2e-16 ***
## month6                  -20.8785400   1.2924017 -16.1548 < 2.2e-16 ***
## month7                  -19.5080971   1.5910851 -12.2609 < 2.2e-16 ***
## month8                  -21.2159465   1.4952454 -14.1889 < 2.2e-16 ***
## month9                  -13.7264403   1.1955096 -11.4817 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    445050
## Residual Sum of Squares: 110360
## R-Squared:      0.75202
## Adj. R-Squared: 0.74732
## F-statistic: 488.566 on 19 and 3061 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m3 <- coeftest(m3, vcov = vcovHC(m3, type = "HC1", cluster = "group"))
print(clustered_test_m3)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -5.5476293   0.6195936  -8.9537 < 2.2e-16 ***
## POST_treat_11            -5.2336318   1.5831400  -3.3059 0.0009578 ***
## POST_treat_21            -1.1955516   0.7701380  -1.5524 0.1206732    
## avg_temperature_monthly  -0.1553601   0.0806890  -1.9254 0.0542696 .  
## total_rain_monthly       -0.0512628   0.0057434  -8.9254 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9003670   0.3867853 -12.6695 < 2.2e-16 ***
## wind_direction_monthly   -0.0802765   0.0246313  -3.2591 0.0011298 ** 
## avg_pressure_monthly     -0.0228754   0.0429907  -0.5321 0.5946946    
## month10                  -7.8904847   0.9652652  -8.1744 4.305e-16 ***
## month11                  -4.8495608   0.5236444  -9.2612 < 2.2e-16 ***
## month12                  -2.1150944   0.2730168  -7.7471 1.269e-14 ***
## month2                   -6.8103280   0.3999415 -17.0283 < 2.2e-16 ***
## month3                  -11.9726520   0.6822960 -17.5476 < 2.2e-16 ***
## month4                  -18.4029092   0.9073398 -20.2823 < 2.2e-16 ***
## month5                  -21.2153214   1.1949235 -17.7545 < 2.2e-16 ***
## month6                  -20.8785400   1.4974873 -13.9424 < 2.2e-16 ***
## month7                  -19.5080971   1.7986847 -10.8458 < 2.2e-16 ***
## month8                  -21.2159465   1.7673268 -12.0045 < 2.2e-16 ***
## month9                  -13.7264403   1.3278991 -10.3370 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The adjusted R-squared is virtually the same. Our treatment effect for stations in Ring 1 is now around -5.2 (p-value = 0.001) and the treatment effect for stations in Ring 2 is about -1.2 (p-value = 0.1). There is an evident decrease in the absolute magnitude of the coefficient for Ring 2 compared to Ring 1, with the effect for the stations in the former ring being insignificant. Hence, the reduction of NO2 in the air was much weaker in the stations farther away from Plaza del Carmen than in those stations in its immediate vicinity (or the centroid station itself). Moreover, the impact of the policy on pollution levels in Ring 1 stations is amplified once we hold the second treatment ring constant. See Section 6.1.D for an alternative specification without this ceteris paribus implication.

Model 4: first three treatment rings with weather controls

We keep the same variables in the model but now we add the third ring of stations, those that lie within 10 and 15 kilometres from the only station inside the LEZ.

# Set up the fixed effects model
m4 <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly +
            month,
            model = "within", index = "station_id",
            data = did_df)

summary(m4)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + POST_treat_2 + 
##     POST_treat_3 + avg_temperature_monthly + total_rain_monthly + 
##     avg_wind_speed_monthly + wind_direction_monthly + avg_pressure_monthly + 
##     month, data = did_df, model = "within", index = "station_id")
## 
## Balanced Panel: n = 40, T = 78, N = 3120
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -22.504041  -3.782079  -0.087779   3.754598  32.824383 
## 
## Coefficients:
##                            Estimate  Std. Error  t-value  Pr(>|t|)    
## POST1                    -5.0031488   0.4431765 -11.2893 < 2.2e-16 ***
## POST_treat_11            -5.7844968   0.8246534  -7.0145  2.83e-12 ***
## POST_treat_21            -1.7447812   0.5981740  -2.9168 0.0035616 ** 
## POST_treat_31            -0.9474833   0.5768037  -1.6426 0.1005593    
## avg_temperature_monthly  -0.1578420   0.0723236  -2.1824 0.0291528 *  
## total_rain_monthly       -0.0512021   0.0044281 -11.5630 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9039605   0.2088739 -23.4781 < 2.2e-16 ***
## wind_direction_monthly   -0.0775976   0.0229203  -3.3855 0.0007193 ***
## avg_pressure_monthly     -0.0229512   0.0193937  -1.1834 0.2367281    
## month10                  -7.8687657   0.8945216  -8.7966 < 2.2e-16 ***
## month11                  -4.8408039   0.6200467  -7.8072  7.97e-15 ***
## month12                  -2.1141712   0.5435174  -3.8898 0.0001025 ***
## month2                   -6.8023153   0.5367147 -12.6740 < 2.2e-16 ***
## month3                  -11.9587836   0.6282758 -19.0343 < 2.2e-16 ***
## month4                  -18.3835960   0.7745375 -23.7349 < 2.2e-16 ***
## month5                  -21.1822229   0.9999938 -21.1824 < 2.2e-16 ***
## month6                  -20.8331133   1.2923393 -16.1205 < 2.2e-16 ***
## month7                  -19.4469611   1.5910793 -12.2225 < 2.2e-16 ***
## month8                  -21.1570753   1.4952603 -14.1494 < 2.2e-16 ***
## month9                  -13.6853587   1.1954397 -11.4480 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    445050
## Residual Sum of Squares: 110270
## R-Squared:      0.75224
## Adj. R-Squared: 0.74746
## F-statistic: 464.53 on 20 and 3060 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m4 <- coeftest(m4, vcov = vcovHC(m4, type = "HC1", cluster = "group"))
print(clustered_test_m4)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -5.0031488   0.9450035  -5.2943 1.279e-07 ***
## POST_treat_11            -5.7844968   1.7413320  -3.3219 0.0009046 ***
## POST_treat_21            -1.7447812   1.0579585  -1.6492 0.0992101 .  
## POST_treat_31            -0.9474833   1.2405338  -0.7638 0.4450628    
## avg_temperature_monthly  -0.1578420   0.0806674  -1.9567 0.0504733 .  
## total_rain_monthly       -0.0512021   0.0058295  -8.7833 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9039605   0.3847909 -12.7445 < 2.2e-16 ***
## wind_direction_monthly   -0.0775976   0.0264252  -2.9365 0.0033441 ** 
## avg_pressure_monthly     -0.0229512   0.0430273  -0.5334 0.5937879    
## month10                  -7.8687657   0.9646806  -8.1569 4.964e-16 ***
## month11                  -4.8408039   0.5223767  -9.2669 < 2.2e-16 ***
## month12                  -2.1141712   0.2741693  -7.7112 1.674e-14 ***
## month2                   -6.8023153   0.3988911 -17.0531 < 2.2e-16 ***
## month3                  -11.9587836   0.6799791 -17.5870 < 2.2e-16 ***
## month4                  -18.3835960   0.9065672 -20.2782 < 2.2e-16 ***
## month5                  -21.1822229   1.1917552 -17.7740 < 2.2e-16 ***
## month6                  -20.8331133   1.4916507 -13.9665 < 2.2e-16 ***
## month7                  -19.4469611   1.7930370 -10.8458 < 2.2e-16 ***
## month8                  -21.1570753   1.7650425 -11.9867 < 2.2e-16 ***
## month9                  -13.6853587   1.3269278 -10.3136 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, the R-squared has not changed much by adding new stations, which makes sense since it is the addition of weather controls and not the incorporation of new units into the treatment that enhances the explainability of the model. Our treatment effects for Rings 1, 2 and 3 are now: -5.8 (p-value = 0.001), -1.7 (p-value = 0.1) and -0.9 (p-value = 0.4), respectively.

Only the effect of the policy for stations in Ring 1 is significant. The effects are insignificant for those in Rings 2 and 3 at the 0.05 significance level, with this “significance being much lower” in Ring 3 compared to Ring 2. This makes sense as Ring 3 stations are farther away from the “restricted area” than Ring 2 stations. These results show that, if there were any important spillover effects, these would be stronger and more likely to be statistically significant in stations closer to where the policy is actually active.

Model 5: four treatment rings with weather controls

Our last model is the most complete one as it incorporates all controls and fixed effects, as well as all stations aside from those in the control group in four different rings (with the last one comprising stations within 20 and 40 kilometres from Plaza del Carmen).

# Set up the fixed effects model
m5 <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly + 
            month,
            model = "within", index = c("station_id"),
            data = did_df)

summary(m5)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = pollution_level ~ POST + POST_treat_1 + POST_treat_2 + 
##     POST_treat_3 + POST_treat_4 + avg_temperature_monthly + total_rain_monthly + 
##     avg_wind_speed_monthly + wind_direction_monthly + avg_pressure_monthly + 
##     month, data = did_df, model = "within", index = c("station_id"))
## 
## Balanced Panel: n = 40, T = 78, N = 3120
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -22.503967  -3.784694  -0.079589   3.752970  32.819618 
## 
## Coefficients:
##                            Estimate  Std. Error  t-value  Pr(>|t|)    
## POST1                    -5.1657638   0.5693263  -9.0735 < 2.2e-16 ***
## POST_treat_11            -5.6205582   0.8999958  -6.2451 4.821e-10 ***
## POST_treat_21            -1.5817642   0.6972897  -2.2684 0.0233714 *  
## POST_treat_31            -0.7830113   0.6807329  -1.1502 0.2501318    
## POST_treat_41             0.4086759   0.8979933   0.4551 0.6490703    
## avg_temperature_monthly  -0.1580053   0.0723339  -2.1844 0.0290094 *  
## total_rain_monthly       -0.0512371   0.0044293 -11.5677 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9034607   0.2089039 -23.4723 < 2.2e-16 ***
## wind_direction_monthly   -0.0781354   0.0229537  -3.4040 0.0006725 ***
## avg_pressure_monthly     -0.0230390   0.0193972  -1.1878 0.2350240    
## month10                  -7.8664352   0.8946522  -8.7927 < 2.2e-16 ***
## month11                  -4.8397316   0.6201315  -7.8044 8.146e-15 ***
## month12                  -2.1132348   0.5435918  -3.8875 0.0001034 ***
## month2                   -6.8025758   0.5367846 -12.6728 < 2.2e-16 ***
## month3                  -11.9585966   0.6283573 -19.0315 < 2.2e-16 ***
## month4                  -18.3824035   0.7746423 -23.7302 < 2.2e-16 ***
## month5                  -21.1811981   1.0001259 -21.1785 < 2.2e-16 ***
## month6                  -20.8315224   1.2925115 -16.1171 < 2.2e-16 ***
## month7                  -19.4455457   1.5912885 -12.2200 < 2.2e-16 ***
## month8                  -21.1560681   1.4954557 -14.1469 < 2.2e-16 ***
## month9                  -13.6836725   1.1956004 -11.4450 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    445050
## Residual Sum of Squares: 110260
## R-Squared:      0.75226
## Adj. R-Squared: 0.7474
## F-statistic: 442.305 on 21 and 3059 DF, p-value: < 2.22e-16
# T-test with clustered standard errors
clustered_test_m5 <- coeftest(m5, vcov = vcovHC(m5, type = "HC1", cluster = "group"))
print(clustered_test_m5)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -5.1657638   1.1887487  -4.3455 1.435e-05 ***
## POST_treat_11            -5.6205582   1.8956858  -2.9649  0.003051 ** 
## POST_treat_21            -1.5817642   1.2960655  -1.2204  0.222394    
## POST_treat_31            -0.7830113   1.4334775  -0.5462  0.584946    
## POST_treat_41             0.4086759   1.9585440   0.2087  0.834725    
## avg_temperature_monthly  -0.1580053   0.0806897  -1.9582  0.050299 .  
## total_rain_monthly       -0.0512371   0.0058207  -8.8026 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9034607   0.3849906 -12.7366 < 2.2e-16 ***
## wind_direction_monthly   -0.0781354   0.0258846  -3.0186  0.002560 ** 
## avg_pressure_monthly     -0.0230390   0.0430191  -0.5356  0.592306    
## month10                  -7.8664352   0.9636218  -8.1634 4.708e-16 ***
## month11                  -4.8397316   0.5216035  -9.2786 < 2.2e-16 ***
## month12                  -2.1132348   0.2742367  -7.7059 1.744e-14 ***
## month2                   -6.8025758   0.3991540 -17.0425 < 2.2e-16 ***
## month3                  -11.9585966   0.6806615 -17.5691 < 2.2e-16 ***
## month4                  -18.3824035   0.9066941 -20.2741 < 2.2e-16 ***
## month5                  -21.1811981   1.1924890 -17.7622 < 2.2e-16 ***
## month6                  -20.8315224   1.4925041 -13.9574 < 2.2e-16 ***
## month7                  -19.4455457   1.7941964 -10.8380 < 2.2e-16 ***
## month8                  -21.1560681   1.7659380 -11.9801 < 2.2e-16 ***
## month9                  -13.6836725   1.3267787 -10.3135 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This last model is equally explanatory as the previous three and the treatment effects for Rings 1, 2, 3 and 4 are around -5.6 (p-value = 0.003), -1.6 (p-value = 0.2), -0.8 (p-value = 0.6) and 0.4 (p-value = 0.8). Again, from Ring 2 onwards the treatment effects lose statistical significance with a decreasing level of significance as our stations are located farther away from the centre. The coefficient and p-value for the treatment group consisting of Ring 4 stations are as expected: close to 0 (even positive!) and quite high, respectively.

We now delve a bit deeper into the interpretation of coefficients other than the interaction term in Model 5. Note that our model outputs do not show an intercept. This is because the ‘within model’ is fitting a separate intercept for each station. We extract these below:

# Extract fixed effects of the "within model" m5
fixef(m5)
##   123-2    14-2   148-4   161-1    45-2    47-2    49-3     5-2    58-4     6-4 
##  90.277  79.740  84.392  79.891  77.656  91.546  99.463  88.942  88.331  86.426 
##   65-14     7-4    74-7   79-11   79-16   79-17   79-18   79-24   79-27   79-35 
##  93.049  85.901  92.868  93.983  88.033  98.605  87.304  73.877  93.845  95.345 
##   79-36   79-38   79-39    79-4   79-40   79-47   79-48   79-49   79-50   79-54 
##  89.713  93.075  90.320  95.650  88.984  87.650  88.722  82.362  91.761  88.414 
##   79-55   79-56   79-57   79-58   79-59   79-60    79-8    80-3     9-1    92-5 
##  98.810 104.255  83.921  68.957  82.969  84.124 105.413  80.783  73.142  82.692

Likewise, the coefficient for ‘POST’ (about -5.2 for Model 5) represents how much the average outcome of the control stations has changed in the post-treatment period. It is significant in all our models, implying pollution has decreased over time in the control stations. This confirms the negative trend observed when investigating parallel trends.

Visualization of treatment effects

We can plot the coefficient estimates for the interaction term (i.e. treatment effect) for each of the four rings and their respective confidence intervals. We extract the coefficient estimates and standard errors and build with them a dataframe, whose contents we plot below:

# Extract coefficient estimates and standard errors
coefficients <- coef(clustered_test_m5)
se <- sqrt(diag(vcovHC(m5, type = "HC1", cluster = "group")))

# Create a dataframe for plotting
df_plot <- data.frame(
  treatment_ring = c("Ring 1 (0 - 2.5km)", "Ring 2 (2.5 - 8km)", "Ring 3 (8 - 15km)", "Ring 4 (15 - 20km)"),
  coefficient = coefficients[grep("POST_treat_", names(coefficients))],
  SE = se[grep("POST_treat_", names(se))])

# Plot the coefficient estimates with confidence intervals
ggplot(df_plot, aes(x = treatment_ring, y = coefficient)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
  geom_errorbar(aes(ymin = coefficient - 1.96 * SE,
                    ymax = coefficient + 1.96 * SE),
                width = 0.2) +
  labs(x = "Treatment Ring", 
       y = "Estimate (95% Confidence Interval)",
       title = "Treatment Effect Estimates by Ring") +
  ylim(-10, 5) +
  theme_light()

Unsurprisingly, only the treatment effect for Ring 1 is significantly different from 0 (in this case, negative). The coefficient estimates get closer and closer to 0 as stations farther away from Plaza del Carmen are considered as “treated”, with that of the last ring being marginally positive.

5.7. Exporting model results

Lastly, we export the results of our five models. To do that, we first create a vector of month dummies in order to exclude them from the regression table for ease of visualization and because they do not provide relevant information about the significance of spillovers. We also set the names of the variables for better readability. Next, since we do not want the original model object (e.g. ‘m5’) but rather the model with the correct (clustered) standard errors (e.g. ‘clustered_test_m5’), we create our own function that calculates these for all five models (and also for those used to assess our most complete model’s robustness in the remainder of this document). This is done because inserting our previously created model objects with clustered standard errors directly into the stargazer() function returned an error once more than 3 models were included.

# Create vector of month dummies to be excluded from the regression table
exclude_vars <- paste0("month", 2:12)

# Assign variables for regression table
variable_labels <- c(
  "POST1" = "Post-Treatment",
  "POST_treat_11" = "Post-Treatment*Ring 1",
  "POST_treat_21" = "Post-Treatment*Ring 2",
  "POST_treat_31" = "Post-Treatment*Ring 3",
  "POST_treat_41" = "Post-Treatment*Ring 4",
  "avg_temperature_monthly" = "Average Monthly Temperature",
  "total_rain_monthly" = "Total Monthly Rain",
  "avg_wind_speed_monthly" = "Average Monthly Wind Speed",
  "wind_direction_monthly" = "Average Monthly Wind Direction",
  "avg_pressure_monthly" = "Average Monthly Pressure")

# Define function to calculate clustered standard errors 
se_clustered <- function(x)
  coeftest(x, vcov = vcovHC(x, type = "HC1", cluster = "group"))[, "Std. Error"]

# Create the list of five models to include in the table
models <- list(m1, m2, m3, m4, m5)

# Generate regression table for 5 different models without the month dummies
stargazer(models,
          se = lapply(models, se_clustered),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
          omit = exclude_vars,
          covariate.labels = variable_labels,
          omit.stat=c("f", "ser"))
## 
## Regression Results
## ========================================================================================
##                                  Dependent Variable: Average Monthly NO2 Concentration  
##                                ---------------------------------------------------------
##                                  Model 1     Model 2     Model 3    Model 4    Model 5  
##                                    (1)         (2)         (3)        (4)        (5)    
## ----------------------------------------------------------------------------------------
## Post-Treatment                  -6.977***   -5.947***   -5.548***  -5.003***  -5.166*** 
##                                  (0.440)     (0.458)     (0.620)    (0.945)    (1.189)  
##                                                                                         
## Post-Treatment*Ring 1           -6.190***   -4.830***   -5.234***  -5.784***  -5.621*** 
##                                  (1.533)     (1.534)     (1.583)    (1.741)    (1.896)  
##                                                                                         
## Post-Treatment*Ring 2                                    -1.196     -1.745*     -1.582  
##                                                          (0.770)    (1.058)    (1.296)  
##                                                                                         
## Post-Treatment*Ring 3                                                -0.947     -0.783  
##                                                                     (1.241)    (1.433)  
##                                                                                         
## Post-Treatment*Ring 4                                                           0.409   
##                                                                                (1.959)  
##                                                                                         
## Average Monthly Temperature                  -0.144*     -0.155*    -0.158*    -0.158*  
##                                              (0.083)     (0.081)    (0.081)    (0.081)  
##                                                                                         
## Total Monthly Rain                          -0.051***   -0.051***  -0.051***  -0.051*** 
##                                              (0.006)     (0.006)    (0.006)    (0.006)  
##                                                                                         
## Average Monthly Wind Speed                  -4.935***   -4.900***  -4.904***  -4.903*** 
##                                              (0.386)     (0.387)    (0.385)    (0.385)  
##                                                                                         
## Average Monthly Wind Direction              -0.075***   -0.080***  -0.078***  -0.078*** 
##                                              (0.025)     (0.025)    (0.026)    (0.026)  
##                                                                                         
## Average Monthly Pressure                     -0.023      -0.023      -0.023     -0.023  
##                                              (0.043)     (0.043)    (0.043)    (0.043)  
##                                                                                         
## ----------------------------------------------------------------------------------------
## Observations                      3,120       3,120       3,120      3,120      3,120   
## R2                                0.670       0.752       0.752      0.752      0.752   
## Adjusted R2                       0.664       0.747       0.747      0.747      0.747   
## ========================================================================================
## Note:                                                        *p<0.1; **p<0.05; ***p<0.01

6. ROBUSTNESS CHECKS

The last step in our analysis consists of performing a series of robustness checks. We use these to test the validity and reliability of our results to different assumptions or specifications of the model.

We will start by introducing several changes to the model with the highest number of regressors (Model 5), consisting of exploring a different level of data aggregation and period fixed effects, using different specifications of the treatment groups in the model, employing a different model set-up and adding as well as modifying our controls.

We then run two placebo tests. We use these to test the validity of our causal inference strategy by examining the effect of the intervention on the outcome variable at a time where no effect is expected and by assessing the impact of this same treatment on a theoretically unrelated outcome.

6.1. Modifications to the most complete model

A. Weekly aggregation of model data

The first robustness check consists of testing whether aggregating data at the weekly (rather than monthly) level has any impact on the significance of the policy effect.

We repeat the steps followed in the first sections of this document, except for the difference in the aggregation level for both pollution and weather data.

Pollution data for pollution monitoring stations inside and outside the City of Madrid (2015 - 2022)

We first load, transform and merge datasets with pollution information from inside as well as outside the City of Madrid. We then combine this with information about the pollution monitoring stations. We remove both observations taking place after July 2021 and in Ring 6 stations.

# Aggregate in-City data to weekly level
city_weekly <- all_city_data_clean %>%
  mutate(week = as.numeric(format(as.Date(paste(year, month, day, sep = "-")), "%U"))) %>%
  group_by(town_code, station_code, year, month, week) %>%
  summarise(pollution_level = mean(pollution_level))

# Aggregate out-of-City data to weekly level
com_weekly <- all_com_data_clean %>%
  mutate(week = as.numeric(format(as.Date(paste(year, month, day, sep = "-")), "%U"))) %>%
  group_by(town_code, station_code, year, month, week) %>%
  summarise(pollution_level = mean(pollution_level))

# Bind together observations of both datasets  
full_data_w <- bind_rows(city_weekly, com_weekly)

# Join 'city_weekly' with 'city_stations' by 'town_code' and 'station_code'
city_data_w <- city_weekly %>%
  left_join(city_stations_clean, by = c("town_code", "station_code"))

# Join 'com_weekly' with 'com_stations' by 'town_code' and 'station_code' 
com_data_w <- com_weekly %>%
  left_join(com_stations_clean, by = c("town_code", "station_code")) 

# Combine both dataframes into a single one for the whole Community
full_data_complete_w <- bind_rows(city_data_w, com_data_w) %>% 
  select(-c(station_name, rural_area_type))

# Replace the missing values in 'town_name' by Madrid
full_data_complete_w$town_name <- ifelse(is.na(full_data_complete_w$town_name), "Madrid", full_data_complete_w$town_name)

# Merge data with monthly pollution averages and station data (including rings and nearby weather stations)
model_pollution_df_w <- full_data_complete_w %>%
  left_join(data_all_stations, by = c("town_code", "station_code", "address", "station_type", 
                                      "longitude", "latitude", "distance_km")) %>% 
  ungroup()

# Remove data from July 2021 onwards 
model_pollution_df_w <- model_pollution_df_w %>% 
  filter(year < 2021 | (year == 2021 & month < 7)) %>% 
  # Remove stations in "Ring 6: Excluded"
  filter(ring != "Ring 6: Excluded")

# Remove any levels not present in the dataframe
model_pollution_df_w$ring <- droplevels(model_pollution_df_w$ring)

Weather data (2015 - 2021)

Concerning weather data, we only need to modify the last step of Section 4 (having done imputation) and change the level of aggregation.

# Group imputed weather variables and compute weekly averages 
weather_full_final_w <- weather_full_final %>% 
  mutate(week = as.numeric(format(as.Date(paste(year, month, day, sep = "-")), "%U"))) %>%
  group_by(weather_station_name, year, month, week) %>% 
  summarise(avg_temperature_monthly = round(mean(avg_temperature), 2),
            min_temperature_monthly = round(min(min_temperature), 2),
            max_temperature_monthly = round(max(min_temperature), 2),
            total_rain_monthly = round(sum(total_rain), 2),
            wind_direction_monthly = round(mean(direction_max_wind_speed), 2),
            avg_wind_speed_monthly = round(mean(avg_wind_speed), 2),
            avg_pressure_monthly = round(mean((min_pressure + max_pressure)/2), 2))

Pre-modelling feature engineering

As for the pre-modelling stage, we merge our datasets with pollution and weather information, after which we create the post-treatment, treatment and interaction variables. We transform these into factors and ‘month’ into a character variable. We create the unique identifier for each station and remove irrelevant variables.

# Rename weather station variable so that it is the same in both datasets
model_pollution_df_w <- model_pollution_df_w %>% 
  rename(weather_station_name = closest_weather_station)

# Merge pollution and weather data
did_df_w <- model_pollution_df_w %>% 
  left_join(weather_full_final_w, 
            by = c("year", "month", "week", "weather_station_name"))

# Create post-treatment, treatment and interaction terms
did_df_w <- did_df_w %>% mutate(
                                # Create post-treatment dummy (assuming implementation is in November 2015)
                                POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0),
                                # Create treatment dummy for all 4 rings
                                treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
                                treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
                                treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
                                treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0),
                                # Create interaction terms by multiplying POST and treatment ring
                                POST_treat_1 = POST*as.numeric(treatment_ring1),
                                POST_treat_2 = POST*as.numeric(treatment_ring2),
                                POST_treat_3 = POST*as.numeric(treatment_ring3),
                                POST_treat_4 = POST*as.numeric(treatment_ring4)) 

# Transform new variables into factors
did_df_w <- did_df_w %>%
  mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))

# Convert month variables to strings
did_df_w$month <- as.character(did_df_w$month)

# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_w$station_id <- as.factor(paste(did_df_w$town_code, did_df_w$station_code, sep="-"))

# Remove unnecessary variables
did_df_w <- did_df_w %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))

Modelling

Finally we run Model 5 on weekly data. We include all four treatment rings and all five weather covariates.

We still include month fixed effects since these are probably enough to capture time-dependent variability. Adding week fixed effects would amount to having too many variables in the model without a much higher degree of control or explainability.

We compute clustered standard errors and display results directly in the regression table.

# Set up the fixed effects model
m5_w <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly + 
            month,
            model = "within", index = c("station_id"),
            data = did_df_w)

# T-test with clustered standard errors
clustered_test_m5_w <- coeftest(m5_w, vcov = vcovHC(m5_w, type = "HC1", cluster = "group"))
print(clustered_test_m5_w)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -4.9767423   1.1820969  -4.2101 2.566e-05 ***
## POST_treat_11            -5.7751098   1.9291483  -2.9936  0.002761 ** 
## POST_treat_21            -1.6400491   1.2988664  -1.2627  0.206723    
## POST_treat_31            -1.1019324   1.4237000  -0.7740  0.438947    
## POST_treat_41             0.1326552   2.0461106   0.0648  0.948308    
## avg_temperature_monthly  -0.1409170   0.0500996  -2.8127  0.004918 ** 
## total_rain_monthly       -0.1090991   0.0093691 -11.6445 < 2.2e-16 ***
## avg_wind_speed_monthly   -5.1668277   0.3222035 -16.0359 < 2.2e-16 ***
## wind_direction_monthly    0.0097103   0.0135465   0.7168  0.473499    
## avg_pressure_monthly      0.0669665   0.0240300   2.7868  0.005330 ** 
## month10                  -6.6578146   0.5727633 -11.6240 < 2.2e-16 ***
## month11                  -4.0874453   0.3413810 -11.9733 < 2.2e-16 ***
## month12                  -1.2063453   0.2147723  -5.6169 1.977e-08 ***
## month2                   -6.1495431   0.4514727 -13.6211 < 2.2e-16 ***
## month3                  -10.5552064   0.5738735 -18.3929 < 2.2e-16 ***
## month4                  -18.0118064   0.7102636 -25.3593 < 2.2e-16 ***
## month5                  -20.6660071   0.9038294 -22.8649 < 2.2e-16 ***
## month6                  -19.0568242   1.0642778 -17.9059 < 2.2e-16 ***
## month7                  -18.5840295   1.2311677 -15.0946 < 2.2e-16 ***
## month8                  -19.8019059   1.2263895 -16.1465 < 2.2e-16 ***
## month9                  -12.6234460   0.8631197 -14.6254 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Calculate clustered standard errors to export results using `stargazer()`
se_m5_w <- se_clustered(m5_w)

# Export results 
stargazer(m5_w, se = list(se_m5_w),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Weekly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 5 (Weekly Aggregation)"),
          omit = exclude_vars,
          covariate.labels = variable_labels)
## 
## Regression Results
## ===================================================================================
##                                Dependent Variable: Average Weekly NO2 Concentration
##                                ----------------------------------------------------
##                                            Model 5 (Weekly Aggregation)            
## -----------------------------------------------------------------------------------
## Post-Treatment                                      -4.977***                      
##                                                      (1.182)                       
##                                                                                    
## Post-Treatment*Ring 1                               -5.775***                      
##                                                      (1.929)                       
##                                                                                    
## Post-Treatment*Ring 2                                 -1.640                       
##                                                      (1.299)                       
##                                                                                    
## Post-Treatment*Ring 3                                 -1.102                       
##                                                      (1.424)                       
##                                                                                    
## Post-Treatment*Ring 4                                 0.133                        
##                                                      (2.046)                       
##                                                                                    
## Average Monthly Temperature                         -0.141***                      
##                                                      (0.050)                       
##                                                                                    
## Total Monthly Rain                                  -0.109***                      
##                                                      (0.009)                       
##                                                                                    
## Average Monthly Wind Speed                          -5.167***                      
##                                                      (0.322)                       
##                                                                                    
## Average Monthly Wind Direction                        0.010                        
##                                                      (0.014)                       
##                                                                                    
## Average Monthly Pressure                             0.067***                      
##                                                      (0.024)                       
##                                                                                    
## -----------------------------------------------------------------------------------
## Observations                                          16,138                       
## R2                                                    0.572                        
## Adjusted R2                                           0.570                        
## F Statistic                               1,021.310*** (df = 21; 16077)            
## ===================================================================================
## Note:                                                   *p<0.1; **p<0.05; ***p<0.01

We see that we now have 16,138 observations, compared to the 3,120 we had when aggregating by month. This could have potentially masked important variability or information, for instance by amplifying measurement error. However, this does not seem to be the case as we obtain essentially the same results in both “aggregation levels”: only average pollution in stations in the first treatment ring seems to be significantly affected by the implementation of Madrid Central (the estimated effect is now around -5.7 µg/m3 compared to -5.6 µg/m3 in the model with monthly observations).

Therefore, we can state the results found in Section 5 are robust to a different level of aggregation.

B. Alternative model settings using two-way fixed effects

Now, rather than testing the robustness of our results in the strict sense, we want to show that changing the model setup (while still using the plm() package) does not change results significantly.

While we still set the ‘model’ argument equal to “within” to specify the use of the fixed effects estimator, the addition of ‘effect = “twoways”’ indicates more explicitly that we want to estimate both individual and time fixed effects. We hence index not only by “station_id” but also by “month”, so that all observations are nested within the unique station and month they belong to.

# Set up the fixed effects model
m5_two <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly +
            wind_direction_monthly + avg_pressure_monthly,
            model = "within",
            effect = "twoways",
            index = c("station_id", "month"),
            data = did_df)

# T-test with clustered standard errors
clustered_test_m5_two <- coeftest(m5_two, vcov = vcovHC(m5_two, type = "HC1", cluster = "group"))
print(clustered_test_m5_two)
## 
## t test of coefficients:
## 
##                           Estimate Std. Error  t value  Pr(>|t|)    
## POST1                   -5.1657638  1.1866445  -4.3533 1.385e-05 ***
## POST_treat_11           -5.6205582  1.8923304  -2.9702  0.002999 ** 
## POST_treat_21           -1.5817642  1.2937714  -1.2226  0.221575    
## POST_treat_31           -0.7830113  1.4309402  -0.5472  0.584281    
## POST_treat_41            0.4086759  1.9550772   0.2090  0.834436    
## avg_temperature_monthly -0.1580053  0.0805469  -1.9617  0.049893 *  
## total_rain_monthly      -0.0512371  0.0058104  -8.8182 < 2.2e-16 ***
## avg_wind_speed_monthly  -4.9034607  0.3843092 -12.7592 < 2.2e-16 ***
## wind_direction_monthly  -0.0781354  0.0258388  -3.0240  0.002516 ** 
## avg_pressure_monthly    -0.0230390  0.0429430  -0.5365  0.591650    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimates (and significance levels) for each of the four treatment rings are the same as before, but we are now unable to check on the ceteris paribus effect of each month fixed effect as these are no longer included explicitly in the model.

We export our results:

# Calculate clustered standard errors to export results using `stargazer()`
se_m5_two <- se_clustered(m5_two)

# Export results 
stargazer(m5_two, 
          se = list(se_m5_two),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 5 ('Two-Way' Settings)"),
          covariate.labels = variable_labels)
## 
## Regression Results
## ====================================================================================
##                                Dependent Variable: Average Monthly NO2 Concentration
##                                -----------------------------------------------------
##                                            Model 5 ('Two-Way' Settings)             
## ------------------------------------------------------------------------------------
## Post-Treatment                                       -5.166***                      
##                                                       (1.187)                       
##                                                                                     
## Post-Treatment*Ring 1                                -5.621***                      
##                                                       (1.892)                       
##                                                                                     
## Post-Treatment*Ring 2                                 -1.582                        
##                                                       (1.294)                       
##                                                                                     
## Post-Treatment*Ring 3                                 -0.783                        
##                                                       (1.431)                       
##                                                                                     
## Post-Treatment*Ring 4                                  0.409                        
##                                                       (1.955)                       
##                                                                                     
## Average Monthly Temperature                          -0.158**                       
##                                                       (0.081)                       
##                                                                                     
## Total Monthly Rain                                   -0.051***                      
##                                                       (0.006)                       
##                                                                                     
## Average Monthly Wind Speed                           -4.903***                      
##                                                       (0.384)                       
##                                                                                     
## Average Monthly Wind Direction                       -0.078***                      
##                                                       (0.026)                       
##                                                                                     
## Average Monthly Pressure                              -0.023                        
##                                                       (0.043)                       
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                                           3,120                        
## R2                                                     0.426                        
## Adjusted R2                                            0.415                        
## F Statistic                                 226.961*** (df = 10; 3059)              
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01

Still, there is an important change compared to the baseline model: we now obtain a considerably lower adjusted R-squared (it has dropped to around 0.40). Hence, this model seems to yield the same results as the initial one albeit with a lower degree of explainability.

C. Alternative specification of treatment group I: non-mutually exclusive rings

Next, we want to ensure that our results are not dependent on the specification of our models by treatment ring. Recall that in Model 5 of Section 5 we created four mutually exclusive treatment variables (and thus interaction terms) so that each would contain a different set of stations in order to test the geographical scope of policy spillovers.

Now we take a slightly different approach and create treatment variables where each treatment ring contains the current ring as well as all other previous rings. For instance, the third treatment ring includes observations in Ring 3, Ring 2 and Ring 1.

In order to do that, we repeat the transformations we used before but now we create our treatment rings following this alternative specification.

# Merge pollution and weather data
did_df_s <- model_pollution_df %>% 
  left_join(weather_full_final_a, 
            by = c("year", "month", "weather_station_name"))

# Create post-treatment dummy (POST = 1 if year is greater than 2018 or if equal to 2018 and the month is December)
did_df_s <- did_df_s %>%
  mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))

# Create treatment dummies for each ring with alternative specification
did_df_s <- did_df_s %>%
  mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
         treatment_ring2 = ifelse(ring == "Ring 2: Treated" | ring == "Ring 1: Treated", 1, 0),
         treatment_ring3 = ifelse(ring == "Ring 3: Treated" | ring == "Ring 2: Treated" | ring == "Ring 1: Treated", 1, 0),
         treatment_ring4 = ifelse(ring == "Ring 4: Treated" | ring == "Ring 3: Treated" | ring == "Ring 2: Treated" | ring == "Ring 1: Treated", 1, 0))

# Create interaction terms 
did_df_s <- did_df_s %>%
  mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
         POST_treat_2 = POST*as.numeric(treatment_ring2),
         POST_treat_3 = POST*as.numeric(treatment_ring3),
         POST_treat_4 = POST*as.numeric(treatment_ring4))

# Transform new variables into factors
did_df_s <- did_df_s %>%
  mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))

# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_s$station_id <- as.factor(paste(did_df_s$town_code, did_df_s$station_code, sep="-"))

# Remove variables
did_df_s <- did_df_s %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))

# Convert month variables to strings
did_df_s$month <- as.character(did_df_s$month)

We now run our most complete model:

# Set up the fixed effects model
m5_s <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly + 
            month,
            model = "within", index = c("station_id"),
            data = did_df_s)

# Calculate clustered standard errors to export results using `stargazer()`
se_m5_s <- se_clustered(m5_s)

# Export results 
stargazer(m5_s, se = list(se_m5_s),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 5 (Alternative Treatment Grouping)"),
          omit = exclude_vars,
          covariate.labels = variable_labels)
## 
## Regression Results
## ====================================================================================
##                                Dependent Variable: Average Monthly NO2 Concentration
##                                -----------------------------------------------------
##                                      Model 5 (Alternative Treatment Grouping)       
## ------------------------------------------------------------------------------------
## Post-Treatment                                       -5.166***                      
##                                                       (1.189)                       
##                                                                                     
## Post-Treatment*Ring 1                                -4.039***                      
##                                                       (1.550)                       
##                                                                                     
## Post-Treatment*Ring 2                                 -0.799                        
##                                                       (0.916)                       
##                                                                                     
## Post-Treatment*Ring 3                                 -1.192                        
##                                                       (1.749)                       
##                                                                                     
## Post-Treatment*Ring 4                                  0.409                        
##                                                       (1.959)                       
##                                                                                     
## Average Monthly Temperature                           -0.158*                       
##                                                       (0.081)                       
##                                                                                     
## Total Monthly Rain                                   -0.051***                      
##                                                       (0.006)                       
##                                                                                     
## Average Monthly Wind Speed                           -4.903***                      
##                                                       (0.385)                       
##                                                                                     
## Average Monthly Wind Direction                       -0.078***                      
##                                                       (0.026)                       
##                                                                                     
## Average Monthly Pressure                              -0.023                        
##                                                       (0.043)                       
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                                           3,120                        
## R2                                                     0.752                        
## Adjusted R2                                            0.747                        
## F Statistic                                 442.305*** (df = 21; 3059)              
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01

Our estimates do not seem to be very different from the ones in the original model, although (for instance) the treatment effect for Ring 1 stations is lower in absolute magnitude in this new specification.

We can repeat our plot with the confidence intervals for each (new) treatment ring to more easily assess whether the new specification leads to largely differing results:

# Extract coefficient estimates and standard errors
coefficients <- coef(m5_s)
se <- sqrt(diag(vcovHC(m5_s)))

# Create a dataframe for plotting
df_plot <- data.frame(
  treatment_ring = c("Ring 1 (0 - 2.5km)", "Ring 2 (0 - 8km)", "Ring 3 (0 - 15km)", "Ring 4 (0 - 20km)"),
  coefficient = coefficients[grep("POST_treat_", names(coefficients))],
  SE = se[grep("POST_treat_", names(se))])

# Plot the coefficient estimates with confidence intervals
ggplot(df_plot, aes(x = treatment_ring, y = coefficient)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "blue") +
  geom_errorbar(aes(ymin = coefficient - 1.96 * SE,
                    ymax = coefficient + 1.96 * SE),
                width = 0.2) +
  labs(x = "Treatment Ring", 
       y = "Estimate (95% Confidence Interval)",
       title = "Treatment Effect Estimates by Ring (Alternative Specification)") +
  ylim(-10, 5) +
  theme_light()

Indeed, the estimates are very similar. The main difference is the lower variability in Ring 2 and the fact that the estimated coefficient for Ring 3 is actually more negative (i.e. higher in absolute value) than that for Ring 2. Despite the lack of statistical significance for both, this result raises some questions as to why including stations farther away from the city centre into the treatment group strengthens the treatment effect. This could also be proof that our original specification may lead to more accurate results than the alternative.

D. Alternative specification of treatment group II: subgroup analysis through individual models

Models 3 to 5 effectively capture the treatment effect for each of the four rings while controlling for the other rings included in the model. That is, the effectiveness of the policy for the stations in each ring depends on the inclusion of the other ring(s).

Yet, just like we did for Model 2 and Ring 1 stations, we can try to run different models where we only include one ring as the treatment group (aside from time and unit fixed effects and the weather controls).

We reproduce Model 2 for comparability and we do the same with Rings 2, 3 and 4:

# Set up model for only Ring 1 stations (same as Model 2, or 'm2' in Section 5)
m2_r1 <- plm(pollution_level ~ POST + POST_treat_1 +
          avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
          wind_direction_monthly + avg_pressure_monthly +
          month,
          model = "within", index = "station_id",
          data = did_df)

clustered_test_m2_r1 <- coeftest(m2_r1, vcov = vcovHC(m2_r1, type = "HC1", cluster = "group"))
print(clustered_test_m2_r1)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -5.9473473   0.4577064 -12.9938 < 2.2e-16 ***
## POST_treat_11            -4.8299489   1.5337223  -3.1492  0.001653 ** 
## avg_temperature_monthly  -0.1435042   0.0829049  -1.7310  0.083561 .  
## total_rain_monthly       -0.0513408   0.0056766  -9.0443 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9349604   0.3855256 -12.8006 < 2.2e-16 ***
## wind_direction_monthly   -0.0750806   0.0246257  -3.0489  0.002317 ** 
## avg_pressure_monthly     -0.0230897   0.0426763  -0.5410  0.588518    
## month10                  -8.0155215   0.9742748  -8.2272 2.802e-16 ***
## month11                  -4.9007961   0.5286663  -9.2701 < 2.2e-16 ***
## month12                  -2.1510398   0.2750942  -7.8193 7.251e-15 ***
## month2                   -6.8213455   0.4013852 -16.9945 < 2.2e-16 ***
## month3                  -11.9980392   0.6871073 -17.4617 < 2.2e-16 ***
## month4                  -18.4655908   0.9147885 -20.1856 < 2.2e-16 ***
## month5                  -21.3357585   1.2068775 -17.6785 < 2.2e-16 ***
## month6                  -21.0576208   1.5158347 -13.8918 < 2.2e-16 ***
## month7                  -19.7359705   1.8265641 -10.8050 < 2.2e-16 ***
## month8                  -21.4271853   1.7942757 -11.9420 < 2.2e-16 ***
## month9                  -13.9027945   1.3480227 -10.3135 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Set up model for only Ring 2 stations
m2_r2 <- plm(pollution_level ~ POST + POST_treat_2 +
          avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
          wind_direction_monthly + avg_pressure_monthly +
          month,
          model = "within", index = "station_id",
          data = did_df)
clustered_test_m2_r2 <- coeftest(m2_r2, vcov = vcovHC(m2_r2, type = "HC1", cluster = "group"))
print(clustered_test_m2_r2)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -6.2897098   0.6463694  -9.7308 < 2.2e-16 ***
## POST_treat_21            -0.4295935   0.8258850  -0.5202 0.6029887    
## avg_temperature_monthly  -0.1627580   0.0813332  -2.0011 0.0454670 *  
## total_rain_monthly       -0.0521784   0.0057062  -9.1441 < 2.2e-16 ***
## avg_wind_speed_monthly   -5.0202094   0.3971890 -12.6393 < 2.2e-16 ***
## wind_direction_monthly   -0.0768988   0.0233469  -3.2937 0.0009999 ***
## avg_pressure_monthly     -0.0294968   0.0430448  -0.6853 0.4932330    
## month10                  -7.8441129   0.9773790  -8.0257 1.425e-15 ***
## month11                  -4.8329441   0.5264660  -9.1800 < 2.2e-16 ***
## month12                  -2.1404266   0.2777896  -7.7052 1.752e-14 ***
## month2                   -6.7627374   0.4038861 -16.7442 < 2.2e-16 ***
## month3                  -11.8720350   0.6914816 -17.1690 < 2.2e-16 ***
## month4                  -18.2963620   0.9184871 -19.9201 < 2.2e-16 ***
## month5                  -21.0898650   1.2127485 -17.3901 < 2.2e-16 ***
## month6                  -20.7190665   1.5242982 -13.5925 < 2.2e-16 ***
## month7                  -19.3200218   1.8308996 -10.5522 < 2.2e-16 ***
## month8                  -21.0509578   1.7872776 -11.7782 < 2.2e-16 ***
## month9                  -13.6196706   1.3447034 -10.1284 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Set up model for only Ring 3 stations
m2_r3 <- plm(pollution_level ~ POST + POST_treat_3 +
          avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
          wind_direction_monthly + avg_pressure_monthly +
          month,
          model = "within", index = "station_id",
          data = did_df)

clustered_test_m2_r3 <- coeftest(m2_r3, vcov = vcovHC(m2_r3, type = "HC1", cluster = "group"))
print(clustered_test_m2_r3)
## 
## t test of coefficients:
## 
##                           Estimate Std. Error  t value  Pr(>|t|)    
## POST1                    -6.671844   0.578949 -11.5241 < 2.2e-16 ***
## POST_treat_31             0.736277   0.998027   0.7377  0.460733    
## avg_temperature_monthly  -0.159509   0.081911  -1.9473  0.051585 .  
## total_rain_monthly       -0.052151   0.005633  -9.2582 < 2.2e-16 ***
## avg_wind_speed_monthly   -5.009075   0.390329 -12.8329 < 2.2e-16 ***
## wind_direction_monthly   -0.078973   0.025817  -3.0589  0.002241 ** 
## avg_pressure_monthly     -0.028879   0.042978  -0.6720  0.501661    
## month10                  -7.872190   0.984802  -7.9937 1.838e-15 ***
## month11                  -4.844132   0.529705  -9.1450 < 2.2e-16 ***
## month12                  -2.141030   0.280721  -7.6269 3.189e-14 ***
## month2                   -6.773702   0.404481 -16.7467 < 2.2e-16 ***
## month3                  -11.892950   0.694112 -17.1340 < 2.2e-16 ***
## month4                  -18.324168   0.918117 -19.9584 < 2.2e-16 ***
## month5                  -21.133339   1.212587 -17.4283 < 2.2e-16 ***
## month6                  -20.778431   1.527542 -13.6025 < 2.2e-16 ***
## month7                  -19.396870   1.837156 -10.5581 < 2.2e-16 ***
## month8                  -21.123098   1.785349 -11.8314 < 2.2e-16 ***
## month9                  -13.670957   1.348851 -10.1353 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Set up model for only Ring 4 stations
m2_r4 <- plm(pollution_level ~ POST + POST_treat_4 +
          avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
          wind_direction_monthly + avg_pressure_monthly +
          month,
          model = "within", index = "station_id",
          data = did_df)

clustered_test_m2_r4 <- coeftest(m2_r4, vcov = vcovHC(m2_r4, type = "HC1", cluster = "group"))
print(clustered_test_m2_r4)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -6.6054301   0.4871396 -13.5596 < 2.2e-16 ***
## POST_treat_41             1.8551707   1.6330799   1.1360  0.256048    
## avg_temperature_monthly  -0.1631823   0.0819349  -1.9916  0.046503 *  
## total_rain_monthly       -0.0522365   0.0056598  -9.2294 < 2.2e-16 ***
## avg_wind_speed_monthly   -5.0174129   0.3921263 -12.7954 < 2.2e-16 ***
## wind_direction_monthly   -0.0767009   0.0235764  -3.2533  0.001153 ** 
## avg_pressure_monthly     -0.0295190   0.0429969  -0.6865  0.492426    
## month10                  -7.8381524   0.9790951  -8.0055 1.673e-15 ***
## month11                  -4.8299127   0.5256390  -9.1887 < 2.2e-16 ***
## month12                  -2.1389652   0.2764164  -7.7382 1.359e-14 ***
## month2                   -6.7625943   0.4045067 -16.7181 < 2.2e-16 ***
## month3                  -11.8711128   0.6952570 -17.0744 < 2.2e-16 ***
## month4                  -18.2925094   0.9211973 -19.8573 < 2.2e-16 ***
## month5                  -21.0857718   1.2155650 -17.3465 < 2.2e-16 ***
## month6                  -20.7137488   1.5276395 -13.5593 < 2.2e-16 ***
## month7                  -19.3123452   1.8377630 -10.5086 < 2.2e-16 ***
## month8                  -21.0428324   1.7883474 -11.7666 < 2.2e-16 ***
## month9                  -13.6132188   1.3493920 -10.0884 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the net effects of Madrid Central on the average air pollution levels in Ring 1, Ring 2, Ring 3 and Ring 4 stations are -4.83, -0.43, 0.74 and 1.85, respectively.

It is useful to compare these coefficients with the estimates we obtained for each ring in Model 5 (i.e. model including all treated stations): -5.6, -1.6, -0.8 and 0.4, respectively. Just like we saw in the complete model, the pollution reduction due to the LEZ is significant for Ring 1 but not for Ring 2 and the rest. However, we now see that for these first two rings, the coefficients are larger (in absolute value) in the model that controls for all stations than those in the models estimating the effect for each ring individually. This could be due to the existence of (negative) spillover effects from the LEZ beyond its immediate proximity. In fact, without clustered standard errors, the policy effect for Ring 2 stations turned out to be significant (and negative) in Models 3, 4 and 5 of our primary regression exercise. However, it is more likely that the fluctuations in coefficient magnitudes are due to the smaller sample size when running each model separately.

For Ring 3 stations, however, the coefficient in the “net effect model” is positive while it was negative in the model with all stations (still insignificant). This suggests that there might be negative spillover effects from the LEZ counteracting positive ones observed when analyzing this group of stations independently. Confounding factors, spatial interactions or (again) simply the sample size could be behind this “incongruence” of the sign of treatment effects between the two models.

Finally, the coefficient on Ring 4 stations when considering them in isolation is higher than the one for those stations in the complete model, with both of them being positive and insignificant. This is the opposite effect as observed for Ring 1 and Ring 2 stations. While this could indicate that the LEZ’s impact on pollution levels is relatively localized in Ring 4, the fact that the change in pollution in these stations shows such low statistical significance renders sample size the most likely driver of this variation.

To close off this section, in order to ensure that the significant effect of the policy on the pollution levels of Ring 1 stations found in both sets of models (“complete” and “individual”) does not exclusively hinge on the impact on the only station inside the LEZ (Plaza del Carmen), we re-run the model on the first treatment ring excluding Plaza del Carmen:

# Merge pollution and weather data
did_df_i <- model_pollution_df %>% 
  left_join(weather_full_final_a, 
            by = c("year", "month", "weather_station_name"))

# Create post-treatment dummy 
did_df_i <- did_df_i %>%
  mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))

# Create treatment dummies for each ring, excluding Plaza del Carmen from the first ("station_id = 79-35")
did_df_i <- did_df_i %>%
   mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated" & !(town_code == 79 & station_code == 35),
                                   1, 0),
         treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
         treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
         treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0))

# Create interaction terms 
did_df_i <- did_df_i %>%
  mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
         POST_treat_2 = POST*as.numeric(treatment_ring2),
         POST_treat_3 = POST*as.numeric(treatment_ring3),
         POST_treat_4 = POST*as.numeric(treatment_ring4))

# Transform new variables into factors
did_df_i <- did_df_i %>%
  mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))

# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_i$station_id <- as.factor(paste(did_df_i$town_code, did_df_i$station_code, sep="-"))

# Remove variables
did_df_i <- did_df_i %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))

# Convert month variables to strings
did_df_i$month <- as.character(did_df_i$month)

# Set up model for only Ring 1 stations, excluding Plaza del Carmen
m2_r1_pc <- plm(pollution_level ~ POST + POST_treat_1 +
          avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
          wind_direction_monthly + avg_pressure_monthly +
          month,
          model = "within", index = "station_id",
          data = did_df_i)

clustered_test_m2_r1_pc <- coeftest(m2_r1_pc, vcov = vcovHC(m2_r1_pc, type = "HC1", cluster = "group"))
print(clustered_test_m2_r1_pc)
## 
## t test of coefficients:
## 
##                            Estimate  Std. Error  t value  Pr(>|t|)    
## POST1                    -6.1168900   0.4695665 -13.0267 < 2.2e-16 ***
## POST_treat_11            -4.1258309   1.9179302  -2.1512  0.031539 *  
## avg_temperature_monthly  -0.1487349   0.0825636  -1.8015  0.071729 .  
## total_rain_monthly       -0.0516428   0.0056438  -9.1504 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.9690367   0.3869532 -12.8414 < 2.2e-16 ***
## wind_direction_monthly   -0.0750636   0.0243207  -3.0864  0.002044 ** 
## avg_pressure_monthly     -0.0253518   0.0427984  -0.5924  0.553658    
## month10                  -7.9713979   0.9764232  -8.1639 4.689e-16 ***
## month11                  -4.8835479   0.5269654  -9.2673 < 2.2e-16 ***
## month12                  -2.1518432   0.2760787  -7.7943 8.805e-15 ***
## month2                   -6.8022806   0.4025034 -16.8999 < 2.2e-16 ***
## month3                  -11.9572639   0.6883657 -17.3705 < 2.2e-16 ***
## month4                  -18.4144112   0.9132709 -20.1631 < 2.2e-16 ***
## month5                  -21.2650737   1.2067274 -17.6221 < 2.2e-16 ***
## month6                  -20.9619615   1.5166982 -13.8208 < 2.2e-16 ***
## month7                  -19.6194279   1.8310187 -10.7150 < 2.2e-16 ***
## month8                  -21.3224146   1.7949011 -11.8794 < 2.2e-16 ***
## month9                  -13.8261309   1.3462614 -10.2700 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The treatment effect is still significant for Ring 1 stations excluding Plaza del Carmen (Plaza de España, Escuelas Aguirre and Parque del Retiro), thereby confirming the robustness of the spatially-bound spillover effect found above. The estimated treatment effect is now smaller in absolute value because the one found in the individual model above for Ring 1 stations encompassed the effect on Plaza del Carmen and thus captured some of the variability (i.e. decrease) in pollution in the policy centroid.

We summarise all five models below, with the second column indicating the results of the last model (Ring 1 stations except Plaza del Carmen):

# Create the list of four models to include in the regression table
models_r <- list(m2_r1, m2_r1_pc, m2_r2, m2_r3, m2_r4)

# Generate regression table for 5 different models 
stargazer(models_r,
          se = lapply(models_r, se_clustered),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Ring 1", "Ring 1 w/o P.C.", "Ring 2", "Ring 3", "Ring 4"),
          omit = exclude_vars,
          covariate.labels = variable_labels)
## 
## Regression Results
## ==========================================================================================
##                                   Dependent Variable: Average Monthly NO2 Concentration   
##                                -----------------------------------------------------------
##                                  Ring 1   Ring 1 w/o P.C.   Ring 2     Ring 3     Ring 4  
##                                   (1)           (2)          (3)        (4)        (5)    
## ------------------------------------------------------------------------------------------
## Post-Treatment                 -5.947***     -6.117***    -6.290***  -6.672***  -6.605*** 
##                                 (0.458)       (0.470)      (0.646)    (0.579)    (0.487)  
##                                                                                           
## Post-Treatment*Ring 1          -4.830***     -4.126**                                     
##                                 (1.534)       (1.918)                                     
##                                                                                           
## Post-Treatment*Ring 2                                       -0.430                        
##                                                            (0.826)                        
##                                                                                           
## Post-Treatment*Ring 3                                                  0.736              
##                                                                       (0.998)             
##                                                                                           
## Post-Treatment*Ring 4                                                             1.855   
##                                                                                  (1.633)  
##                                                                                           
## Average Monthly Temperature     -0.144*       -0.149*      -0.163**   -0.160*    -0.163** 
##                                 (0.083)       (0.083)      (0.081)    (0.082)    (0.082)  
##                                                                                           
## Total Monthly Rain             -0.051***     -0.052***    -0.052***  -0.052***  -0.052*** 
##                                 (0.006)       (0.006)      (0.006)    (0.006)    (0.006)  
##                                                                                           
## Average Monthly Wind Speed     -4.935***     -4.969***    -5.020***  -5.009***  -5.017*** 
##                                 (0.386)       (0.387)      (0.397)    (0.390)    (0.392)  
##                                                                                           
## Average Monthly Wind Direction -0.075***     -0.075***    -0.077***  -0.079***  -0.077*** 
##                                 (0.025)       (0.024)      (0.023)    (0.026)    (0.024)  
##                                                                                           
## Average Monthly Pressure         -0.023       -0.025        -0.029     -0.029     -0.030  
##                                 (0.043)       (0.043)      (0.043)    (0.043)    (0.043)  
##                                                                                           
## ------------------------------------------------------------------------------------------
## Observations                     3,120         3,120        3,120      3,120      3,120   
## R2                               0.752         0.750        0.748      0.748      0.749   
## Adjusted R2                      0.747         0.745        0.743      0.744      0.744   
## F Statistic (df = 18; 3062)    514.578***   510.393***    505.235*** 505.611*** 506.457***
## ==========================================================================================
## Note:                                                          *p<0.1; **p<0.05; ***p<0.01

We can also re-run Model 5 (“complete”) on the four treatment rings, with the first ring excluding Plaza del Carmen to assess to what extent the significance of our results in Model 5 depends upon the presence of the policy centroid.

# Set up model for all rings, with Ring 1 excluding Plaza del Carmen
m5_pc <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly + 
            month,
            model = "within", index = c("station_id"),
            data = did_df_i)

# T-test with clustered standard errors
clustered_test_m5_pc <- coeftest(m5_pc, vcov = vcovHC(m5_pc, type = "HC1", cluster = "group"))
print(clustered_test_m5_pc)
## 
## t test of coefficients:
## 
##                           Estimate Std. Error  t value  Pr(>|t|)    
## POST1                    -6.184982   1.374174  -4.5009 7.020e-06 ***
## POST_treat_11            -4.058568   2.343204  -1.7321  0.083364 .  
## POST_treat_21            -0.550210   1.493939  -0.3683  0.712679    
## POST_treat_31             0.246594   1.610832   0.1531  0.878341    
## POST_treat_41             1.437687   2.090946   0.6876  0.491771    
## avg_temperature_monthly  -0.159357   0.081034  -1.9666  0.049325 *  
## total_rain_monthly       -0.051681   0.005659  -9.1326 < 2.2e-16 ***
## avg_wind_speed_monthly   -4.941223   0.387999 -12.7351 < 2.2e-16 ***
## wind_direction_monthly   -0.080064   0.025421  -3.1496  0.001651 ** 
## avg_pressure_monthly     -0.025496   0.043082  -0.5918  0.554032    
## month10                  -7.858674   0.965462  -8.1398 5.699e-16 ***
## month11                  -4.836986   0.521565  -9.2740 < 2.2e-16 ***
## month12                  -2.120219   0.276202  -7.6763 2.187e-14 ***
## month2                   -6.792014   0.400510 -16.9584 < 2.2e-16 ***
## month3                  -11.932414   0.685254 -17.4131 < 2.2e-16 ***
## month4                  -18.355180   0.908584 -20.2020 < 2.2e-16 ***
## month5                  -21.156826   1.199360 -17.6401 < 2.2e-16 ***
## month6                  -20.801995   1.503309 -13.8375 < 2.2e-16 ***
## month7                  -19.416795   1.807609 -10.7417 < 2.2e-16 ***
## month8                  -21.134841   1.775233 -11.9054 < 2.2e-16 ***
## month9                  -13.669269   1.331756 -10.2641 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unsurprisingly, the coefficients do not vary much but the treatment effect in Ring 1 (three stations closest to Plaza del Carmen) is now only significant at the 0.1 significance level. Moreover, the positive spillovers seem to be higher in fourth-ring stations when Plaza del Carmen is excluded. This means that, although the policy clearly has beneficial effects in terms of pollution reduction in the vicinity of the LEZ, the decrease is much stronger inside than immediately outside of it.

E. Additional controls I: model with station type control

Again, more than a robustness check, we want to make sure our unit (station) fixed effects are capturing the variability in pollution levels that arises due to station-to-station differences. Perhaps the type of station contributes significantly to this variation so that the inability of fixed effects to cover it would yield biased ATT estimates.

Recall that there are five types of stations in our model:

did_df %>% distinct(station_type) %>% pull()
## [1] "Urbana trafico"     "Urbana fondo"       "Suburbana (ciudad)"
## [4] "Suburbana fondo"    "Urbana industrial"

We re-run Model 5 while controlling for the type of station:

# Set up the fixed effects model
m5_st <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly + 
            month + as.factor(station_type),
            model = "within", index = c("station_id"),
            data = did_df)

# Calculate clustered standard errors to export results using `stargazer()`
se_m5_st <- se_clustered(m5_st)

# Export the results 
stargazer(m5_st, 
          se = list(se_m5_st),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 5 (Station Type)"),
          omit = exclude_vars,
          covariate.labels = variable_labels)
## 
## Regression Results
## ====================================================================================
##                                Dependent Variable: Average Monthly NO2 Concentration
##                                -----------------------------------------------------
##                                               Model 5 (Station Type)                
## ------------------------------------------------------------------------------------
## Post-Treatment                                       -5.166***                      
##                                                       (1.189)                       
##                                                                                     
## Post-Treatment*Ring 1                                -5.621***                      
##                                                       (1.896)                       
##                                                                                     
## Post-Treatment*Ring 2                                 -1.582                        
##                                                       (1.296)                       
##                                                                                     
## Post-Treatment*Ring 3                                 -0.783                        
##                                                       (1.433)                       
##                                                                                     
## Post-Treatment*Ring 4                                  0.409                        
##                                                       (1.959)                       
##                                                                                     
## Average Monthly Temperature                           -0.158*                       
##                                                       (0.081)                       
##                                                                                     
## Total Monthly Rain                                   -0.051***                      
##                                                       (0.006)                       
##                                                                                     
## Average Monthly Wind Speed                           -4.903***                      
##                                                       (0.385)                       
##                                                                                     
## Average Monthly Wind Direction                       -0.078***                      
##                                                       (0.026)                       
##                                                                                     
## Average Monthly Pressure                              -0.023                        
##                                                       (0.043)                       
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                                           3,120                        
## R2                                                     0.752                        
## Adjusted R2                                            0.747                        
## F Statistic                                 442.305*** (df = 21; 3059)              
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01

Fortunately for the sake of our chosen model, the estimated coefficients, standard errors and significance levels for all our variables are identical to those in the model without the station type control (Section 5). Note that, by indexing by station, we are already accounting for differences related to the type of station (station location, primarily). This renders the station type control redundant and thus does not even appear in the model.

F. Additional controls II: model with public works control

Between April and November of 2018, the City Council undertook the comprehensive remodelling of Gran Via. Since this station is the main avenue crossing the LEZ, we want to know if controlling for it (and thus for one potential confounder as seen in our DAG above), makes any difference for our estimates.

Once again, we load our data and create our treatment dummies like we did before.

# Merge pollution and weather data
did_df_gv <- model_pollution_df %>% 
  left_join(weather_full_final_a, 
            by = c("year", "month", "weather_station_name"))

# Create post-treatment dummy (POST = 1 if year is greater than 2018 or if equal to 2018 and the month is December)
did_df_gv <- did_df_gv %>%
  mutate(POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0))

# Create treatment dummies for each ring
did_df_gv <- did_df_gv %>%
  mutate(treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
         treatment_ring2 = ifelse(ring == "Ring 2: Treated", 1, 0),
         treatment_ring3 = ifelse(ring == "Ring 3: Treated", 1, 0),
         treatment_ring4 = ifelse(ring == "Ring 4: Treated", 1, 0))

# Create interaction terms by multiplying POST and each corresponding treatment ring
did_df_gv <- did_df_gv %>%
  mutate(POST_treat_1 = POST*as.numeric(treatment_ring1),
         POST_treat_2 = POST*as.numeric(treatment_ring2),
         POST_treat_3 = POST*as.numeric(treatment_ring3),
         POST_treat_4 = POST*as.numeric(treatment_ring4))

# Transform new variables into factors
did_df_gv <- did_df_gv %>%
  mutate_if(grepl("treatment_ring|POST_treat|POST", names(.)), ~as.factor(.))

# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_gv$station_id <- as.factor(paste(did_df_gv$town_code, did_df_gv$station_code, sep="-"))

# Remove variables
did_df_gv <- did_df_gv %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))

Then, after transforming our “month” variable to integer values to allow for more reliable computations, we create a dummy for the remodelling of Gran Via which takes a value of 1 for observations between April and November of 2018 (both included), and 0 otherwise.

# Convert month variable to integers and create dummy for Gran Via public works
did_df_gv <- did_df_gv %>%
  mutate(month = as.integer(month)) %>%
  mutate(GRAN_VIA = as.factor(ifelse(year == 2018 & month >= 4 & month <= 11, 1, 0)))

# Convert month variables to strings before modelling
did_df_gv$month <- as.character(did_df_gv$month)

We run our model:

# Set up the fixed effects model with Gran Via control
m5_gv <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly + 
            month + as.factor(GRAN_VIA),
            model = "within", index = c("station_id"),
            data = did_df_gv)

# Specify variable names
variable_labels_gv <- c(
  "POST1" = "Post-Treatment",
  "POST_treat_11" = "Post-Treatment*Ring 1",
  "POST_treat_21" = "Post-Treatment*Ring 2",
  "POST_treat_31" = "Post-Treatment*Ring 3",
  "POST_treat_41" = "Post-Treatment*Ring 4",
  "avg_temperature_monthly" = "Average Monthly Temperature",
  "total_rain_monthly" = "Total Monthly Rain",
  "avg_wind_speed_monthly" = "Average Monthly Wind Speed",
  "wind_direction_monthly" = "Average Monthly Wind Direction",
  "avg_pressure_monthly" = "Average Monthly Pressure",
  "as.factor(GRAN_VIA)1" = "Gran Via")

# Calculate clustered standard errors to export results using `stargazer()`
se_m5_gv <- se_clustered(m5_gv)

# Export results 
stargazer(m5_gv, 
          se = list(se_m5_gv),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 5 (Gran Via Public Works)"),
          omit = exclude_vars,
          covariate.labels = variable_labels_gv)
## 
## Regression Results
## ====================================================================================
##                                Dependent Variable: Average Monthly NO2 Concentration
##                                -----------------------------------------------------
##                                           Model 5 (Gran Via Public Works)           
## ------------------------------------------------------------------------------------
## Post-Treatment                                       -5.701***                      
##                                                       (1.226)                       
##                                                                                     
## Post-Treatment*Ring 1                                -5.573***                      
##                                                       (1.906)                       
##                                                                                     
## Post-Treatment*Ring 2                                 -1.600                        
##                                                       (1.308)                       
##                                                                                     
## Post-Treatment*Ring 3                                 -0.759                        
##                                                       (1.445)                       
##                                                                                     
## Post-Treatment*Ring 4                                  0.434                        
##                                                       (1.968)                       
##                                                                                     
## Average Monthly Temperature                          -0.197**                       
##                                                       (0.080)                       
##                                                                                     
## Total Monthly Rain                                   -0.050***                      
##                                                       (0.005)                       
##                                                                                     
## Average Monthly Wind Speed                           -5.049***                      
##                                                       (0.376)                       
##                                                                                     
## Average Monthly Wind Direction                       -0.093***                      
##                                                       (0.026)                       
##                                                                                     
## Average Monthly Pressure                              -0.031                        
##                                                       (0.043)                       
##                                                                                     
## Gran Via                                             -3.592***                      
##                                                       (0.412)                       
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                                           3,120                        
## R2                                                     0.759                        
## Adjusted R2                                            0.755                        
## F Statistic                                 438.723*** (df = 22; 3058)              
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01

The estimates change slightly compared to the original model but not significantly. However, we see the variable ‘GRAN_VIA’ (or ‘Gran Via Remodelling’ in the regression table) is actually significant. The presence of construction works in Gran Via during the 8 months prior to the introduction of the LEZ had an average effect on the pollution level of around -3.6 µg/m3 (compared to its absence), holding all other variables in the model constant. This means that there is a significant overall decrease in average pollution in Madrid associated with this particular instance of public works.

Yet, this does not say anything about the pollution level in the first ring (which includes Gran Via street), since we cannot interact this dummy with ‘POST_treat_1’ as there is no time overlap between the implementation of the LEZ and the presence of construction work.

Hence, we refrain from including this variable in our main models since the estimates are basically the same and the combination of unit fixed effects (for Plaza del Carmen in this case) with month fixed effects should take care of (a lot of) this variability anyway.

G. Model with year fixed effects

The next check entails including year, rather than month, fixed effects in our model. In theory, month fixed effects should perform the same function as year fixed effects by capturing time-related variability in pollution levels affecting all stations equally, yet in a more nuanced way.

Hence we simply run Model 5 again but control for the year rather than the month:

# Convert year variables to strings
did_df$year <- as.character(did_df$year)

# Set up the fixed effects model
m5_y <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
            wind_direction_monthly + avg_pressure_monthly + 
            year,
            model = "within", index = c("station_id"),
            data = did_df)

# Specify variable names
variable_labels_y <- c(
  "POST1" = "Post-Treatment",
  "POST_treat_11" = "Post-Treatment*Ring 1",
  "POST_treat_21" = "Post-Treatment*Ring 2",
  "POST_treat_31" = "Post-Treatment*Ring 3",
  "POST_treat_41" = "Post-Treatment*Ring 4",
  "avg_temperature_monthly" = "Average Monthly Temperature",
  "total_rain_monthly" = "Total Monthly Rain",
  "avg_wind_speed_monthly" = "Average Monthly Wind Speed",
  "wind_direction_monthly" = "Average Monthly Wind Direction",
  "avg_pressure_monthly" = "Average Monthly Pressure",
  "year2016" = "2016",
  "year2017" = "2017",
  "year2018" = "2018",
  "year2019" = "2019",
  "year2020" = "2020",
  "year2021" = "2021")

# Calculate clustered standard errors to export results using `stargazer()`
se_m5_y <- se_clustered(m5_y)

# Export results 
stargazer(m5_y, 
          se = list(se_m5_y),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 5 (Year Fixed Effects)"),
          covariate.labels = variable_labels_y)
## 
## Regression Results
## ====================================================================================
##                                Dependent Variable: Average Monthly NO2 Concentration
##                                -----------------------------------------------------
##                                            Model 5 (Year Fixed Effects)             
## ------------------------------------------------------------------------------------
## Post-Treatment                                        -0.449                        
##                                                       (1.582)                       
##                                                                                     
## Post-Treatment*Ring 1                                -5.092**                       
##                                                       (2.026)                       
##                                                                                     
## Post-Treatment*Ring 2                                 -1.602                        
##                                                       (1.478)                       
##                                                                                     
## Post-Treatment*Ring 3                                 -1.110                        
##                                                       (1.589)                       
##                                                                                     
## Post-Treatment*Ring 4                                  0.268                        
##                                                       (2.203)                       
##                                                                                     
## Average Monthly Temperature                          -1.036***                      
##                                                       (0.031)                       
##                                                                                     
## Total Monthly Rain                                   -0.059***                      
##                                                       (0.007)                       
##                                                                                     
## Average Monthly Wind Speed                           -7.345***                      
##                                                       (0.464)                       
##                                                                                     
## Average Monthly Wind Direction                         0.013                        
##                                                       (0.024)                       
##                                                                                     
## Average Monthly Pressure                               0.019                        
##                                                       (0.051)                       
##                                                                                     
## 2016                                                 -1.096***                      
##                                                       (0.420)                       
##                                                                                     
## 2017                                                   0.280                        
##                                                       (0.425)                       
##                                                                                     
## 2018                                                 -3.428***                      
##                                                       (0.560)                       
##                                                                                     
## 2019                                                  -1.688                        
##                                                       (1.229)                       
##                                                                                     
## 2020                                                 -8.832***                      
##                                                       (1.043)                       
##                                                                                     
## 2021                                                 -9.645***                      
##                                                       (1.220)                       
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                                           3,120                        
## R2                                                     0.684                        
## Adjusted R2                                            0.678                        
## F Statistic                                 413.750*** (df = 16; 3064)              
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01

We see that the coefficient estimates for the interaction terms are quite similar to the model with month fixed effects. Interestingly, the ‘POST’ variable loses significance and its coefficient changes from around -5 to -0.5. This means that this model does not find a significant change in the average pollution level of stations in the control group in the post-treatment period unlike the model with month fixed effects.

There could be several explanations for this. Most clearly, since only month fixed effects capture seasonal variation in pollution levels, the significant and larger estimate for the model including them suggests that there is actually a significant difference in pollution levels throughout the year which this new model misses.

Another important difference with respect to the month-fixed effects model is that the (adjusted) R-squared has dropped from around 0.75 to 0.67. Hence, month fixed effects do seem to account for more variability in the outcome than year fixed effects.

H. Model with quadratic specification of the weather covariates

Finally, to address potential non-linearities in the effect of the weather covariates on the pollution level, we include their squared terms in our regression for Model 5:

# Set up the fixed effects model with quadratic weather controls
m5_quad <- plm(pollution_level ~ POST + POST_treat_1 + POST_treat_2 + POST_treat_3 + POST_treat_4 +
            avg_temperature_monthly + I(avg_temperature_monthly^2) +
            total_rain_monthly + I(total_rain_monthly^2) +
            avg_wind_speed_monthly + I(avg_wind_speed_monthly^2) + 
            wind_direction_monthly + I(wind_direction_monthly^2) +
            avg_pressure_monthly + I(avg_pressure_monthly^2) +
            month,
            model = "within", index = c("station_id"),
            data = did_df)

# Rename variables 
variable_labels_q <- c(
  "POST1" = "Post-Treatment",
  "POST_treat_11" = "Post-Treatment*Ring 1",
  "POST_treat_21" = "Post-Treatment*Ring 2",
  "POST_treat_31" = "Post-Treatment*Ring 3",
  "POST_treat_41" = "Post-Treatment*Ring 4",
  "avg_temperature_monthly" = "Average Monthly Temperature",
  "I(avg_temperature_monthly2)" = "Squared Average Monthly Temperature",
  "total_rain_monthly" = "Total Monthly Rain",
  "I(total_rain_monthly2)" = "Squared Total Monthly Rain",
  "avg_wind_speed_monthly" = "Average Monthly Wind Speed",
  "I(avg_wind_speed_monthly2)" = "Squared Average Monthly Wind Speed",
  "wind_direction_monthly" = "Average Monthly Wind Direction",
  "I(wind_direction_monthly2)" = "Squared Average Monthly Wind Direction",
  "avg_pressure_monthly" = "Average Monthly Pressure",
  "I(avg_pressure_monthly2)" = "Squared Average Monthly Pressure")

# Calculate clustered standard errors to export results using `stargazer()`
se_m5_quad <- se_clustered(m5_quad)

# Export results 
stargazer(m5_quad, 
          se = list(se_m5_quad),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 5 (Quadratic Specification)"),
          omit = exclude_vars,
          covariate.labels = variable_labels_q)
## 
## Regression Results
## ============================================================================================
##                                        Dependent Variable: Average Monthly NO2 Concentration
##                                        -----------------------------------------------------
##                                                  Model 5 (Quadratic Specification)          
## --------------------------------------------------------------------------------------------
## Post-Treatment                                               -5.254***                      
##                                                               (1.193)                       
##                                                                                             
## Post-Treatment*Ring 1                                        -4.942***                      
##                                                               (1.894)                       
##                                                                                             
## Post-Treatment*Ring 2                                         -1.089                        
##                                                               (1.315)                       
##                                                                                             
## Post-Treatment*Ring 3                                         -0.497                        
##                                                               (1.435)                       
##                                                                                             
## Post-Treatment*Ring 4                                          0.633                        
##                                                               (1.933)                       
##                                                                                             
## Average Monthly Temperature                                    0.248                        
##                                                               (0.306)                       
##                                                                                             
## Squared Average Monthly Temperature                           -0.013*                       
##                                                               (0.008)                       
##                                                                                             
## Total Monthly Rain                                           -0.079***                      
##                                                               (0.006)                       
##                                                                                             
## Squared Total Monthly Rain                                   0.0002***                      
##                                                              (0.00002)                      
##                                                                                             
## Average Monthly Wind Speed                                   -9.985***                      
##                                                               (1.155)                       
##                                                                                             
## Squared Average Monthly Wind Speed                           0.876***                       
##                                                               (0.190)                       
##                                                                                             
## Average Monthly Wind Direction                                 0.009                        
##                                                               (0.090)                       
##                                                                                             
## Squared Average Monthly Wind Direction                        -0.001                        
##                                                               (0.001)                       
##                                                                                             
## Average Monthly Pressure                                      -4.401                        
##                                                               (3.497)                       
##                                                                                             
## Squared Average Monthly Pressure                               0.002                        
##                                                               (0.002)                       
##                                                                                             
## --------------------------------------------------------------------------------------------
## Observations                                                   3,120                        
## R2                                                             0.763                        
## Adjusted R2                                                    0.758                        
## F Statistic                                         377.668*** (df = 26; 3054)              
## ============================================================================================
## Note:                                                            *p<0.1; **p<0.05; ***p<0.01

There are three main insights gathered from the quadratic specification.

First, the coefficients for the meteorological covariates in their linear form diverge from those assigned to the quadratic terms. For instance, the square of total rain is significant and positive (coefficient = 0.0002) while total rain in its linear form has a negative (and also significant) effect on pollution (coefficient = -0.079). Hence, higher rainfall is associated with lower pollution levels but the effect of the former on the later seems to vary as the amount of rain increases. Concerning significance levels, none of the variables that were insignificant in the linear model gain significance in the quadratic one, implying non-linear effects are not that strong.

Second, the estimates for the effect of the policy for the first treatment ring is close but slightly lower in absolute value in the quadratic model. This change could be explained by potential non-linearities now captured by the squared terms or by multicollinearity due to the addition of variables containing the same information as those already in the model.

Third, the adjusted R-squared has barely increased compared to the original model, indicating that the addition of squared terms does not significantly improve the model fit.

Hence, our results are robust to the quadratic specification of the controls as only one quadratic term is significant and our treatment effect estimates are very close to the ones in the linear model.

6.2. Falsification test I: testing the effects of the treatment on the outcome in a ‘placebo’ time period

Let’s falsify our results. We start by testing the significance of our treatment effect (within the 2.5-kilometre range from Plaza del Carmen since we found no further spillover effects) in years prior to the implementation of Madrid Central. This will allow us to assess whether the estimated treatment effect we found is driven by the treatment itself or rather by other factors that may be present in the data. In other words, for our treatment effect to be robust, we would like to find no statistical significance in the coefficient of our ‘POST_treat_1’ variable when we specify the treatment took place in the arbitrarily chosen date of November 2015.

We reproduce the steps used above, but this time with data from 2012 to June of 2018 to avoid any anticipatory effects going up to November of 2018 (date of actual implementation of the policy). We also only use the Model 2 specification from Section 5 as (again) it is only for the first treatment ring that we found statistical significance in the effect of the policy.

Pollution data for pollution monitoring stations inside the City of Madrid (2012 - 2018)

We load pollution hourly data for the city of Madrid for the years 2012 to 2018.

# `For` loop for hourly data for all months in all years
file_names_r <- c(paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo12.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo13.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo14.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo15.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo16.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo17.csv"),
                paste0(c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic"), "_mo18.csv"))

# Create an empty dataframe to store the cleaned data
all_city_data_clean_r <- data.frame()

# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names_r) {
  
  # Read the data
  city_data_r <- read_delim(file_name, delim=";")
  
  # Apply transformations 
  city_data_clean_r <- city_data_r %>% 
    select(-c(PROVINCIA, PUNTO_MUESTREO)) %>% 
    rename(town_code = MUNICIPIO,
           station_code = ESTACION,
           magnitude = MAGNITUD,
           year = ANO,
           month = MES,
           day = DIA) %>% 
    select(-matches("^V")) %>%
    pivot_longer(cols = starts_with("H"), names_to = "hour", values_to = "pollution_level") %>%
    # Keep only NO2 measures and remove 'magnitude' column
    filter(magnitude == "8") %>% 
    select(-magnitude) %>%
    # Adjust variable types
    mutate(month = as.numeric(month), 
           day = as.numeric(day),
           hour = as.numeric(str_replace(hour, "H", "")),
           # Remove leading zeros from 'code' variables
           town_code = as.factor(str_remove(town_code, "^0+")),
           station_code = as.factor(str_remove(station_code, "^0+")),
           # Replace comma with point for 'pollution_level'
           pollution_level = as.numeric(gsub(",", ".", pollution_level)))
    
  # Append the cleaned data to the 'all_city_data_clean' dataframe
   all_city_data_clean_r <- bind_rows(all_city_data_clean_r, city_data_clean_r)
}

# Aggregate the data to monthly level
city_monthly_r <- all_city_data_clean_r %>% 
  group_by(town_code, station_code, year, month) %>% 
  summarise(pollution_level = mean(pollution_level)) 

Pollution data for pollution monitoring stations outside the City of Madrid (2012 - 2018)

We do the same for pollution stations outside the city of Madrid.

# Create file names vector
file_names_r <- c(paste0("com_", c(2012:2018), ".csv"))

# Create an empty dataframe to store the cleaned data
all_com_data_clean_r <- data.frame()

# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names_r) {
  
  # Read the data
  com_data_r <- read_delim(file_name, delim=";")
  
  # Apply transformations 
  com_data_clean_r <- com_data_r %>% 
    dplyr::select(-c(provincia, punto_muestreo)) %>% 
    rename(town_code = municipio,
           station_code = estacion,
           magnitude = magnitud,
           year = ano,
           month = mes,
           day = dia) %>% 
    select(-matches("^V")) %>%
    pivot_longer(cols = starts_with("h"), names_to = "hour", values_to = "pollution_level") %>%
    # Keep only NO2 measures and remove 'magnitude' column
    filter(magnitude == "8") %>% 
    select(-magnitude) %>%
    # Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
    mutate(month = as.numeric(month),
           day = as.numeric(day),
           hour = as.numeric(str_replace(hour, "h", "")),
           town_code = as.factor(str_remove(town_code, "^0+")),
           station_code = as.factor(str_remove(station_code, "^0+")),
           pollution_level = as.numeric(gsub(",", ".", pollution_level))) 
   
  # Append the cleaned data to the 'all_com_data_clean' dataframe and drop NA
  all_com_data_clean_r <- bind_rows(all_com_data_clean_r, com_data_clean_r) %>% 
    drop_na(pollution_level) 
}

# Aggregate the data to monthly level
com_monthly_r <- all_com_data_clean_r %>% 
  group_by(town_code, station_code, year, month) %>% 
  summarise(pollution_level = mean(pollution_level)) 

Complete dataset of monthly pollution observations for the air quality network of the Community of Madrid

We merge both sets of data by row. We also merge in-City pollution data with in-City station data and out-of-City pollution data with out-of-City station data. We merge those two datasets as well.

# Bind potential treatment with control stations 
full_data_r <- bind_rows(city_monthly_r, com_monthly_r)

# Join 'city_monthly' with 'city_stations' by 'town_code' and 'station_code'
city_data_r <- city_monthly_r %>%
  left_join(city_stations_clean, by = c("town_code", "station_code"))

# Join 'com_monthly' with 'com_stations' by 'town_code' and 'station_code' 
com_data_r <- com_monthly_r %>%
  left_join(com_stations_clean, by = c("town_code", "station_code")) 

# Combine both dataframes into a single one for the whole Community
full_data_complete_r <- bind_rows(city_data_r, com_data_r) %>% 
  select(-c(station_name, rural_area_type))

# Replace the missing values in 'town_name' by Madrid
full_data_complete_r$town_name <- ifelse(is.na(full_data_complete_r$town_name), "Madrid", full_data_complete_r$town_name) 

Dataset with pollution data and station information necessary for modelling

We take the last dataset and join it with our previously created dataframe with new variables for each station. Just like we did for the estimation of our actual treatment effect (back then with June 2021), we only keep observations up to June 2018. We also remove Ring 6 stations.

# Merge data with monthly pollution averages and station data (including rings and nearby weather stations)
model_pollution_df_r <- full_data_complete_r %>%
  left_join(data_all_stations, by = c("town_code", "station_code", "address", "station_type", 
                                      "longitude", "latitude", "distance_km")) %>% 
  ungroup()

# Remove data from July 2018 onwards 
model_pollution_df_r <- model_pollution_df_r %>% 
  filter(year < 2018 | (year == 2018 & month < 7)) %>% 
  # Remove stations in "Ring 6: Excluded"
  filter(ring != "Ring 6: Excluded")

# Remove any levels not present in the dataframe
model_pollution_df_r$ring <- droplevels(model_pollution_df_r$ring)

Weather data (2012 - 2018)

We load weather data for the new time window and make the appropriate transformations. We impute missing values and group by month just like we did before with the actual treatment period.

set.seed(1234)

# Obtain daily values for all stations in Madrid
weather_full_r <- aemet_daily_clim(station = "all", start = "2012-01-01", end = "2018-06-30", 
                                   verbose = FALSE, return_sf = FALSE) %>% 
  filter(provincia == "MADRID") 

# Make transformations: select, adjust variable types and create new columns
weather_full_final_r <- weather_full_r %>% 
  select(date = fecha,
         weather_station_name = nombre,
         avg_temperature = tmed,
         min_temperature = tmin, 
         max_temperature = tmax,
         total_rain = prec,
         direction_max_wind_speed = dir,
         avg_wind_speed = velmedia,
         min_pressure = presMin,
         max_pressure = presMax) %>% 
  # Replace comma with point
  mutate(total_rain = as.numeric(gsub(",", ".", total_rain)),
         direction_max_wind_speed = as.numeric(direction_max_wind_speed),
         # Obtain 'year' and 'month' columns from 'date' column
         date = ymd(date),
         year = year(date),
         month = month(date)) %>% 
  select(-date) 

# Replace all values of "88" in wind direction with NAs 
weather_full_final_r$direction_max_wind_speed <- ifelse(weather_full_final_r$direction_max_wind_speed == "88", NA,
                                                      weather_full_final_r$direction_max_wind_speed)
# Specify columns to impute
cols_to_impute <- c("avg_temperature", "max_temperature", "min_temperature", 
                    "total_rain",  "direction_max_wind_speed", "avg_wind_speed", 
                    "min_pressure", "max_pressure")

# Impute missing values in the columnsh using PMM
imputed_data <- mice(weather_full_final_r[cols_to_impute], m = 10, method = "pmm")

# Update original dataframe with imputed values
weather_full_final_r[cols_to_impute] <- complete(imputed_data)[, cols_to_impute]

# Group imputed weather variables and compute monthly averages 
weather_full_final_r <- weather_full_final_r %>% 
  group_by(weather_station_name, year, month) %>% 
  summarise(avg_temperature_monthly = round(mean(avg_temperature), 2),
            min_temperature_monthly = round(min(min_temperature), 2),
            max_temperature_monthly = round(max(min_temperature), 2),
            total_rain_monthly = round(sum(total_rain), 2),
            wind_direction_monthly = round(mean(direction_max_wind_speed), 2),
            avg_wind_speed_monthly = round(mean(avg_wind_speed), 2),
            avg_pressure_monthly = round(mean((min_pressure + max_pressure)/2), 2))

Pre-modelling feature engineering

We create our dataset ‘did_df_r’. This includes pollution and weather data which we will use to run our model. We create the dummies that go into our regressions - time, treatment ring (again, only the first) and the interaction between the two - and transform them into factors. We make some more modifications just like we did above.

# Rename weather station variable so that it is the same in both datasets
model_pollution_df_r <- model_pollution_df_r %>% 
  rename(weather_station_name = closest_weather_station)

# Merge pollution and weather data
did_df_r <- model_pollution_df_r %>% 
  left_join(weather_full_final_r, 
            by = c("year", "month", "weather_station_name"))

# Create post-treatment, treatment and interaction terms
did_df_r <- did_df_r %>% mutate(
                                # Create post-treatment dummy (assuming implementation is in November 2015)
                                POST = ifelse(year > 2015 | (year == 2015 & month == 12), 1, 0),
                                # Create treatment dummy for ring 1
                                treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
                                # Create interaction terms by multiplying POST and treatment ring
                                POST_treat_1 = POST*as.numeric(treatment_ring1)) 


# Transform new variables into factors
did_df_r <- did_df_r %>%
  mutate(across(any_of(c("POST", "treatment_ring1", "POST_treat_1")), as.factor))

# Create 'station_id' by concatenating 'town_code' and 'station_code', separated by a hyphen
did_df_r$station_id <- as.factor(paste(did_df_r$town_code, did_df_r$station_code, sep="-"))

# Remove variables
did_df_r <- did_df_r %>% select(-c(min_temperature_monthly, max_temperature_monthly, town_code, station_code))

# Convert month variables to strings
did_df_r$month <- as.character(did_df_r$month)

Modelling

We set up the model with the first treatment ring, the weather controls and the month and station fixed effects. We compute clustered standard errors.

# Set up the fixed effects model
m <- plm(pollution_level ~ POST + POST_treat_1 +
              avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
              wind_direction_monthly + avg_pressure_monthly +
              month,
              model = "within", index = "station_id",
              data = did_df_r)

# Rename variables
variable_labels_r <- c(
  "POST1" = "Post-Treatment",
  "POST_treat_11" = "Post-Treatment*Ring 1",
  "avg_temperature_monthly" = "Average Monthly Temperature",
  "total_rain_monthly" = "Total Monthly Rain",
  "avg_wind_speed_monthly" = "Average Monthly Wind Speed",
  "wind_direction_monthly" = "Average Monthly Wind Direction",
  "avg_pressure_monthly" = "Average Monthly Pressure")

# Calculate clustered standard errors to export results using `stargazer()`
se_m <- se_clustered(m)

# Export results 
stargazer(m, 
          se = list(se_m),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly NO2 Concentration",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 2 (Placebo Treatment Period)"),
          omit = exclude_vars,
          covariate.labels = variable_labels_r)
## 
## Regression Results
## ====================================================================================
##                                Dependent Variable: Average Monthly NO2 Concentration
##                                -----------------------------------------------------
##                                         Model 2 (Placebo Treatment Period)          
## ------------------------------------------------------------------------------------
## Post-Treatment                                       2.141***                       
##                                                       (0.380)                       
##                                                                                     
## Post-Treatment*Ring 1                                  1.768                        
##                                                       (1.330)                       
##                                                                                     
## Average Monthly Temperature                            0.118                        
##                                                       (0.077)                       
##                                                                                     
## Total Monthly Rain                                   -0.050***                      
##                                                       (0.008)                       
##                                                                                     
## Average Monthly Wind Speed                           -4.207***                      
##                                                       (0.468)                       
##                                                                                     
## Average Monthly Wind Direction                       -0.078***                      
##                                                       (0.029)                       
##                                                                                     
## Average Monthly Pressure                              -0.013                        
##                                                       (0.039)                       
##                                                                                     
## ------------------------------------------------------------------------------------
## Observations                                           3,100                        
## R2                                                     0.746                        
## Adjusted R2                                            0.741                        
## F Statistic                                 495.926*** (df = 18; 3042)              
## ====================================================================================
## Note:                                                    *p<0.1; **p<0.05; ***p<0.01

Clearly, the interaction term lacks statistical significance, meaning the “fake treatment” we arbitrarily specified as taking place in November 2015 had no important effects on pollution level in the four stations closest to Plaza del Carmen. This contrasts with the “actual treatment” happening in November 2018, which did have a significant effect on pollution for the same four stations.

This suggests that the estimated treatment effect during the actual treatment period is more likely to be driven by the treatment itself rather than by other factors, which provides some evidence for the robustness of the results.

6.3. Falsification test II: testing the effects of the treatment on a ‘placebo’ outcome

The second placebo test consists of reproducing our analysis above (same criteria for assigning stations into treatment and same period of intervention) but using an outcome that, in theory, should not be affected (or affected but to a much lower degree) by the treatment. Due to the availability of open data, we select noise pollution.

It is true that a potential association between the introduction of an LEZ and average noise levels cannot be completely discarded. Going back to our DAG, Madrid Central could have an effect on noise levels through the traffic redirection mechanism (i.e. fewer cars in the immediacy of Madrid Central, lower average noise levels). However, this effect is likely to be considerably smaller than the one on pollution levels due to (for instance) the neutralization of the fleet recomposition channel in this new identification strategy.

Noise pollution data inside the City of Madrid (2015 - 2021)

Hence, we download the data from the City Council’s Data Portal. Monthly data is available since 1998 only for the City of Madrid, which is enough for our purposes as we only found significant effects on pollution in the immediacy of the policy centroid.

We want to use data from 2015 to 2021 as we did for the main regression on pollution levels to ensure comparability. Yet, data is only available in CSV format since 2017 so we create a loop to read data for the last five years in our sample. We select the following variables: date, station code, station name and the noise level. Among all available indicators, we select “LAeq 24”, which captures the average long-term noise level in the area during all 24-hour periods of each month.

We perform some data cleaning and remove missing values due to the fact that they are few and evenly distributed through time and space.

# Create file names vector
file_names <- c(paste0("Anio", c(2017:2021), "_acus.csv"))

# Create an empty dataframe to store the cleaned data
all_acus_data_clean <- data.frame()

# Loop through the file names and apply the same transformations to each dataset
for (file_name in file_names) {
  
  # Read the data 
  acus_data <- read_delim(file_name, delim = ";")
  
  # Apply transformations 
  acus_data_clean <- acus_data %>% 
    select(date = Fecha,
           station_code = NMT,
           station_name = Nombre,
           avg_24_hours = LAeq24) %>% 
    # Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
    mutate(station_code = as.factor(station_code),
           avg_24_hours = ifelse(avg_24_hours == "N/D", NA, avg_24_hours),
           avg_24_hours = as.numeric(gsub(",", ".", avg_24_hours))) 
   
  # Append the cleaned data to the 'all_acus_data_clean' dataframe 
  all_acus_data_clean <- bind_rows(all_acus_data_clean, acus_data_clean) 
}

# Create a vector of month abbreviations in the correct order
month_abbreviations <- c("ene", "feb", "mar", "abr", "may", "jun", "jul", "ago", "sep", "oct", "nov", "dic")

# Additional transformations
all_acus_data_clean <- all_acus_data_clean %>%
  # Split the date column into month and year columns
  mutate(month = as.numeric(match(tolower(substr(date, 1, 3)), tolower(month_abbreviations))),
         year = as.numeric(paste0("20", substr(date, 5, 6)))) %>% 
  # Remove the original date column
  select(-date) %>% 
  # Remove missing values
  drop_na(avg_24_hours)

For years 2015 and 2016, data is only available in XLS format (with data for each month stored in separate sheets of the same workbook). We hence make use of Excel VBA to create a CSV file out of each of these sheets and set up another loop to read them. We start with 2016 data and apply similar transformations as we did above.

# Create a vector of month abbreviations in the correct order
months <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Junio", 
                         "Julio", "Agosto", "Septiembre", "Octubre", "Noviembre", "Diciembre")

# Create an empty dataframe to store the cleaned data
all_acus_data_clean_2016 <- data.frame()

  # Loop through the months from January to December
for (month in months) {
    # Construct the file name
    file_name <- paste0("2016_", month, "_acus.csv")
    
    # Read the data 
    acus_data_2016 <- read_csv(file_name)
    
    # Apply transformations 
    acus_data_clean_2016 <- acus_data_2016 %>% 
      select(station_code = NMT,
             station_name = Nombre,
             avg_24_hours = LAeq24) %>% 
      # Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
      mutate(station_code = as.factor(station_code),
             avg_24_hours = ifelse(avg_24_hours == "N/D", NA, avg_24_hours),
             avg_24_hours = as.numeric(gsub(",", ".", avg_24_hours)),
             # Add year and month columns
             year = as.numeric(2016),
             month = as.numeric(match(tolower(month), tolower(months))))
    
    # Append the cleaned data to the 'all_acus_data_clean_2016' dataframe 
    all_acus_data_clean_2016 <- bind_rows(all_acus_data_clean_2016, acus_data_clean_2016) 
}

Unfortunately, there is no available data for the months of June and July of 2015, forcing us to continue our analysis while acknowledging the non-random missigness of data during this period. We load the data and transform it:

# Create a vector of month abbreviations in the correct order
months <- c("Enero", "Febrero", "Marzo", "Abril", "Mayo", "Agosto", 
            "Septiembre", "Octubre", "Noviembre", "Diciembre")

# Create an empty dataframe to store the cleaned data
all_acus_data_clean_2015 <- data.frame()

  # Loop through the months from January to December, excluding June and July
for (month in months) {
    # Construct the file name
    file_name <- paste0("2015_", month, "_acus.csv")
    
    # Read the data 
    acus_data_2015 <- read_csv(file_name)
    
    # Apply transformations 
    acus_data_clean_2015 <- acus_data_2015 %>% 
      select(station_code = NMT,
             station_name = Nombre,
             avg_24_hours = LAeq24) %>% 
      # Adjust variable types, remove leading zeros, replace points with commas and adjust contents for comparability
      mutate(station_code = as.factor(station_code),
             avg_24_hours = ifelse(avg_24_hours == "N/D", NA, avg_24_hours),
             avg_24_hours = as.numeric(gsub(",", ".", avg_24_hours)),
             year = as.numeric(2015),
             month = as.numeric(match(tolower(month), tolower(months))))
    
    # Append the cleaned data to the 'all_acus_data_clean_2015' dataframe 
    all_acus_data_clean_2015 <- bind_rows(all_acus_data_clean_2015, acus_data_clean_2015) 
}

We merge all three datasets into one with monthly observations from 2015 to 2021.

# Bind together observations of all years  
full_data_acus <- bind_rows(all_acus_data_clean, all_acus_data_clean_2016, all_acus_data_clean_2015)

Data for noise pollution monitoring stations inside the City of Madrid

Just like we did with the pollution monitoring station network in Madrid, we download information about the location (address and coordinates) and unique identifiers of these noise quality stations. There are 31 in total, some of which overlap in location with pollution measuring stations (for instance the policy centroid, i.e. Plaza del Carmen) and some which do not.

# Read the data
acus_stations <- read_delim("EstacionesMedidaControlAcustico.csv", delim =";")

# Select relevant columns and extract number of station code  
acus_stations_clean <- acus_stations %>% 
  select(station_code = NMT,
         station_name = NOMBRE,
         address = DIRECCION,
         longitude = LONGITUD,
         latitude = LATITUD) %>% 
  mutate(station_code = as.factor(gsub("\\D", "", station_code)),
         longitude = as.numeric(longitude),
         latitude = as.numeric(latitude))

Once again, we extract the coordinates of the Plaza del Carmen station and compute the distance of all 31 noise monitoring stations from it.

This time we create three different rings (as we are only considering stations within the City which do not extend past 15 km of the centroid), while preserving the range for the first three rings (i.e. both pollution and noise stations within 2.5 km are placed into Ring 1, those between 2.5 and 8 km into Ring 2, etc.). Ring 1 now contains 6 stations, compared to the 4 it had in the ring specification of the primary pollution regression.

# Extract latitude and longitude of Plaza del Carmen 
plaza_carmen_longitude <- acus_stations_clean$longitude[acus_stations_clean$station_name == "Plaza del Carmen"]
plaza_carmen_latitude <- acus_stations_clean$latitude[acus_stations_clean$station_name == "Plaza del Carmen"]

# Calculate the distance between each station and Plaza del Carmen
acus_stations_clean <- acus_stations_clean %>%
  mutate(distance_km = round(distHaversine(cbind(longitude, latitude), 
                                           c(plaza_carmen_longitude, plaza_carmen_latitude)) / 1000, 2)) %>% 
  arrange(distance_km)

# Specify cutoffs for 'distance_km' and assign ring labels
acus_stations_clean$ring <- cut(acus_stations_clean$distance_km, breaks = c(-Inf, 2.5, 8, Inf),
                     labels = c("Ring 1: Treated", 
                                "Ring 2: Treated", 
                                "Ring 3: Control"))

# Display number of stations in each ring
acus_stations_clean %>% group_by(ring) %>% count()
## # A tibble: 3 × 2
## # Groups:   ring [3]
##   ring                n
##   <fct>           <int>
## 1 Ring 1: Treated     6
## 2 Ring 2: Treated    15
## 3 Ring 3: Control    10

Data for weather stations in the Community of Madrid

We reproduce the strategy we used to match each pollution station with the closest weather station, only now we use the coordinates of the stations in the noise monitoring network of the City Council.

In theory meteorological conditions should not have a strong effect on noise levels. However, to ensure the comparability of our primary and placebo regressions, we will include weather covariates in the following model.

# Access data on weather stations and make transformations
weather_stations_r <- aemet_stations(verbose = FALSE, return_sf = FALSE) %>% 
  filter(provincia == "MADRID") %>% 
  select(-c(indicativo, indsinop, provincia)) %>% 
  rename(altitude = altitud,
         longitude = longitud,
         latitude = latitud,
         weather_station_name = nombre)

# Create a matrix of coordinates for noise stations
coords_noise <- acus_stations_clean[, c("longitude", "latitude")]

# Create a matrix of coordinates for weather stations
coords_weather_r <- weather_stations_r[, c("longitude", "latitude")]

# Calculate distances between each noise station and each weather station
distances <- distm(coords_noise, coords_weather_r, fun = distHaversine)

# Find the index of the closest weather station for each noise station
closest_weather_index <- apply(distances, 1, which.min)

# Match each noise station to the closest weather station
acus_stations_clean$closest_weather_station <- weather_stations_r$weather_station_name[closest_weather_index]

Pre-modelling data aggregation and feature engineering

Now that we have all the data that we need, we merge noise level averages with the information about each station (including rings and closest weather stations) and remove observations after July 2021 to ensure our results are comparable to those of our main analysis.

# Merge data with monthly noise pollution averages and station data 
model_pollution_acus <- full_data_acus %>%
  left_join(acus_stations_clean, by = c("station_code", "station_name"))

# Remove data from July 2021 onwards 
model_pollution_acus <- model_pollution_acus %>% 
  filter(year < 2021 | (year == 2021 & month < 7)) 

Next, we merge this new dataframe with the weather information.

# Rename weather station variable so that it is the same in both datasets
model_pollution_acus <- model_pollution_acus %>% 
  rename(weather_station_name = closest_weather_station)

# Merge pollution and weather data
did_df_acus <- model_pollution_acus %>% 
  left_join(weather_full_final_a, 
            by = c("year", "month", "weather_station_name"))

Just like we have done before, we create the post-treatment, treatment (only for Ring 1 since that is where we found significant effects) and the interaction term. We adjust variable types and discard irrelevant variables.

# Create post-treatment, treatment and interaction terms
did_df_acus <- did_df_acus %>% mutate(
                                # Create post-treatment dummy (assuming implementation is in November 2018)
                                POST = ifelse(year > 2018 | (year == 2018 & month == 12), 1, 0),
                                # Create treatment dummy for ring 1
                                treatment_ring1 = ifelse(ring == "Ring 1: Treated", 1, 0),
                                # Create interaction terms by multiplying POST and treatment ring
                                POST_treat_1 = POST*as.numeric(treatment_ring1)) 

# Transform new variables into factors
did_df_acus <- did_df_acus %>%
  mutate(across(any_of(c("POST", "treatment_ring1", "POST_treat_1")), as.factor))

# Remove variables
did_df_acus <- did_df_acus %>% select(-c(min_temperature_monthly, max_temperature_monthly, address))

# Convert month variables to strings
did_df_acus$month <- as.character(did_df_acus$month)

Modelling

Finally, we run the model of average monthly noise levels on the treatment dummy (for Plaza del Carmen and the five closest stations) and the weather covariates. We compute clustered standard errors.

# Set up the fixed effects model
m_noise <- plm(avg_24_hours ~ POST + POST_treat_1 +
               avg_temperature_monthly + total_rain_monthly + avg_wind_speed_monthly + 
               wind_direction_monthly + avg_pressure_monthly +
               month,
               model = "within", index = "station_code",
               data = did_df_acus)

# T-test with clustered standard errors
clustered_test_m_noise <- coeftest(m_noise, vcov = vcovHC(m_noise, type = "HC1", cluster = "group"))
print(clustered_test_m_noise)
## 
## t test of coefficients:
## 
##                           Estimate Std. Error  t value  Pr(>|t|)    
## POST1                   -230.68199    5.41079 -42.6337 < 2.2e-16 ***
## POST_treat_11             -0.13983   10.85402  -0.0129  0.989722    
## avg_temperature_monthly   -2.55793    1.12080  -2.2822  0.022567 *  
## total_rain_monthly         0.26553    0.14242   1.8644  0.062396 .  
## avg_wind_speed_monthly  -103.09887   13.74835  -7.4990 9.149e-14 ***
## wind_direction_monthly     3.84028    0.92450   4.1539 3.389e-05 ***
## avg_pressure_monthly      -2.72291    0.61979  -4.3933 1.168e-05 ***
## month10                    8.92238   16.96038   0.5261  0.598889    
## month11                   66.67095   15.31216   4.3541 1.395e-05 ***
## month12                   83.63105   14.92348   5.6040 2.347e-08 ***
## month2                    44.12627   10.45118   4.2221 2.515e-05 ***
## month3                    74.98101   13.11653   5.7165 1.229e-08 ***
## month4                    48.69441   14.96806   3.2532  0.001158 ** 
## month5                    77.66473   18.19179   4.2692 2.042e-05 ***
## month6                    98.28343   20.33526   4.8332 1.433e-06 ***
## month7                    96.64916   29.12647   3.3183  0.000920 ***
## month8                    86.78405   22.89251   3.7909  0.000154 ***
## month9                    69.11585   21.91805   3.1534  0.001635 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Calculate clustered standard errors to export results using `stargazer()`
se_m_noise <- se_clustered(m_noise)

# Export results
stargazer(m_noise, 
          se = list(se_m_noise),
          title = "Regression Results",
          align = TRUE,
          type = "text",
          dep.var.caption = "Dependent Variable: Average Monthly Noise Level",
          dep.var.labels.include = FALSE,
          column.labels = c("Model 2 (Placebo Outcome)"),
          omit = exclude_vars,
          covariate.labels = variable_labels_r)
## 
## Regression Results
## ==============================================================================
##                                Dependent Variable: Average Monthly Noise Level
##                                -----------------------------------------------
##                                           Model 2 (Placebo Outcome)           
## ------------------------------------------------------------------------------
## Post-Treatment                                   -230.682***                  
##                                                    (5.411)                    
##                                                                               
## Post-Treatment*Ring 1                              -0.140                     
##                                                   (10.854)                    
##                                                                               
## Average Monthly Temperature                       -2.558**                    
##                                                    (1.121)                    
##                                                                               
## Total Monthly Rain                                 0.266*                     
##                                                    (0.142)                    
##                                                                               
## Average Monthly Wind Speed                       -103.099***                  
##                                                   (13.748)                    
##                                                                               
## Average Monthly Wind Direction                    3.840***                    
##                                                    (0.924)                    
##                                                                               
## Average Monthly Pressure                          -2.723***                   
##                                                    (0.620)                    
##                                                                               
## ------------------------------------------------------------------------------
## Observations                                        2,333                     
## R2                                                  0.264                     
## Adjusted R2                                         0.249                     
## F Statistic                               45.605*** (df = 18; 2284)           
## ==============================================================================
## Note:                                              *p<0.1; **p<0.05; ***p<0.01

Luckily, with a p-value of 0.99, the effect of the intervention is found to have insignificant effects on the average noise level in Ring 1 stations. This increases the plausibility of observed effects on air pollution being truly attributable to the LEZ. The validity of our research design is thereby strengthened.

7. APPENDIX

This section includes visualizations and tables belonging to the Appendix in the main document.

7.1. Weighted imputation of missing values in the weather controls

As we saw in Section 4.3, the assumption required for multiple imputation (MAR) might not be satisfied as missingness appears to be associated with the station. With the hope of improving the reliability of the imputation procedure, we incorporate weights following the approach in Gomila and Clark (2020). Larger weights are assigned to observations that have higher probabilities of going missing and smaller weights are assigned to observations with lower chances of missingness (Inverse Probability Weighting, or IPM).

We create a separate dataframe to impute missing values using weights and perform the same transformations as before:

# Repeat loading and transforming of weather data
weather_full_final_imp <- weather_full %>% 
  select(date = fecha,
         weather_station_name = nombre,
         avg_temperature = tmed,
         min_temperature = tmin, 
         max_temperature = tmax,
         total_rain = prec,
         direction_max_wind_speed = dir,
         avg_wind_speed = velmedia,
         min_pressure = presMin,
         max_pressure = presMax) %>% 
  # Replace comma with point
  mutate(total_rain = as.numeric(gsub(",", ".", total_rain)),
         direction_max_wind_speed = as.numeric(direction_max_wind_speed),
         # Obtain 'year' and 'month' columns from 'date' column
         date = ymd(date),
         year = year(date),
         month = month(date)) %>% 
  select(-date) 

# Replace all values of "88" in wind direction with NAs 
weather_full_final_imp$direction_max_wind_speed <- 
  ifelse(weather_full_final_imp$direction_max_wind_speed == "88", NA,                                                      weather_full_final_imp$direction_max_wind_speed)

Again, we take the example of minimum pressure and create a variable that takes a value of 1 for available observations and 0 for missing values. We then regress this new variable on the column capturing the weather station using logistic regression, after which we generate probabilities for missingness and the corresponding weights (i.e. inverse of probabilities).

We then perform imputation by Weighted Predictive Mean Matching (PMM) (or Weighted Normal Linear Regression). We incorporate the imputed values into the original dataframe, ‘weather_full_final_imp’.

set.seed(1234)

# Example for minimum pressure
weather_full_final_imp <- weather_full_final_imp %>% 
  mutate(missing = as.numeric(!is.na(weather_full_final_imp$min_pressure)))

# Model of missingness of minimum pressure conditional on the weather station
m_missing <- glm(missing ~ weather_station_name,
                  family = binomial(link = "logit"),
                  data = weather_full_final_imp)

# Calculate the predicted probabilities of missingness
prob_missing <- m_missing$fitted

# Calculate the inverse of the predicted probabilities (weights)
gen_weights <- 1/prob_missing

# Perform multiple imputations using weighted PMM
imputed_data <- mice::mice(weather_full_final_imp, 
                           method = "weighted.pmm",
                           imputationWeights = gen_weights)

# Update original dataframe with imputed values
weather_full_final_imp$min_pressure <- mice::complete(imputed_data)[, "min_pressure"]

Before we look at the imputed distribution, we confirm that it is actually different from the distribution of the imputed variable using normal or unweighted PMM:

# Check the imputed distributions from PMM and PMM with weights are different
are_identical <- identical(weather_full_final$min_pressure, weather_full_final_imp$min_pressure)
print(are_identical)
## [1] FALSE

Below we see that the five-number summary and distribution of minimum pressure after imputing by PMM with weights is very similar to both the original and the unweighted PMM-imputed one. Hence, we retain the PMM imputation not involving weights for all our weather variables in the model.

# Print 5-number summary for 'min_pressure'
summary(weather_full_final_imp$min_pressure)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   780.4   892.4   931.5   909.2   943.1   971.9
# Create histogram for the distribution of 'min_pressure' using weights
ggplot(weather_full_final_imp, aes(x = min_pressure)) +
  geom_histogram(binwidth = 4, color = "black", fill = "lightpink", position ="identity") +
  labs(x = "Values", 
       y = "Frequency",
       title = "PMM-Imputed Distribution of Minimum Pressure with Weights") +
  theme_light() +
  theme(plot.title = element_text(size = 12)) 

7.2. Access rules according to vehicle pollution potential

In order to aid the user in the understanding of the rules applying to cars entering Madrid Central according to their polluting potential, we create the following table:

# Define variable names 
labels <- c("No label", "B", "C", "ECO", "ZERO")

# Define  descriptions
descriptions <- c("Unclassified gasoline and diesel vehicles", 
                  "Gasoline vehicles registered after 2000 and diesel vehicles registered after 2006",
                  "Gasoline vehicles registered after 2006 and diesel vehicles registered after 2014",
                  "Plug-in hybrid vehicles with a range below 40 km and gas vehicles",
                  "Electric (100% and range-extended) and (plug-in) hybrid vehicles with a range over 40 km")

# Define access rules
access_rules <- c("Never", 
                  "7 – 13 hours (off-street parking spaces)", 
                  "7 – 17 hours (off-street parking spaces)", 
                  "7 – 24 hours (on-street parking spaces)", 
                  "24 hours")

# Create dataframe with variables names, descriptions and access rules
MC_description_df <- data.frame(labels, descriptions, access_rules)

# Convert 'labels' to a factor variable with the desired order of levels
MC_description_df$labels <- factor(MC_description_df$labels, levels = labels, ordered = TRUE)

# Rename columns in the final dataframe
names(MC_description_df) <- c("Label", "Description", "Access to Madrid Central LEZ*")

# Format table 
kable(MC_description_df, format = "html", row.names = FALSE) %>%
  kable_styling(c("striped", "hover", "bordered"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) 
Label Description Access to Madrid Central LEZ*
No label Unclassified gasoline and diesel vehicles Never
B Gasoline vehicles registered after 2000 and diesel vehicles registered after 2006 7 – 13 hours (off-street parking spaces)
C Gasoline vehicles registered after 2006 and diesel vehicles registered after 2014 7 – 17 hours (off-street parking spaces)
ECO Plug-in hybrid vehicles with a range below 40 km and gas vehicles 7 – 24 hours (on-street parking spaces)
ZERO Electric (100% and range-extended) and (plug-in) hybrid vehicles with a range over 40 km 24 hours

7.3. Timeline of the implementation of Madrid Central

Similarly, we create a timeline guiding the user through the different phases of the implementation of Madrid Central for the time window under consideration using the vistime()function.

# Create dataframe with information about event (start and end date)
timeline_data <- data.frame(event = c("Transition", 
                                      "Sanctions", 
                                      "",
                                      "Application with sanctions (slight rule change)"), 
                            start = c("2018-11-30", 
                                      "2019-03-16",
                                      "2019-07-01",
                                      "2019-07-09"), 
                            end = c("2019-03-15",
                                    "2019-06-30",
                                    "2019-07-08",
                                    "2021-06-30"))

# Create static timeline with labels
vistime(timeline_data, optimize_y = TRUE, title = "Timeline of the Implementation of Madrid Central")

We also create an interactive version using hc_vistime() from the highcharter package with slighly more detailed information compared to the static timeline:

# Create dataframe with information about event (start and end date)
timeline_data <- data.frame(event = c("Transitionary Period I: no cameras", 
                                      "Transitionary Period II: monitoring system without sanctions", 
                                      "Application with sanctions", 
                                      "Suspension",
                                      "Application with sanctions: slight rule change"), 
                            start = c("2018-11-30", 
                                      "2019-01-01", 
                                      "2019-03-16",
                                      "2019-07-01",
                                      "2019-07-09"), 
                            end = c("2019-01-01", 
                                    "2019-03-15",
                                    "2019-07-01",
                                    "2019-07-08",
                                    "2021-06-30"))

# Create interactive timeline without labels
hc_vistime(timeline_data, show_labels = FALSE, title = "Timeline of the Implementation of Madrid Central")

Finally, we create a table with a brief explanation of the changes taking place during the dates included in the timelines.

# Define dates
dates <- c("30th November, 2018",
           "1st January, 2019", 
           "15th March, 2019", 
           "1st - 7th July, 2019",
           "1st January, 2020",
           "July, 2021")

# Define events
event <- c("Official day of entry into force of Madrid Central", 
           "Activation of camera monitoring system",
           "Beginning of sanctions",
           "Temporary suspension of LEZ sanctions",
           "Introduction of slightly modified rules",
           "Approval of 'Madrid 360'")

# Define event descriptions 
descriptions <- c("First phase of the transition period. No active cameras, with police officers monitoring and enforcing rules.", 
"Automatic monitoring system with cameras at the LEZ perimeter. Warnings for non-compliance sent by post.",
"Infractions first subject to fines ranging up to 90 euros, or 45 euros with early payment.",
"Suspension of sanctions under the new mayor of Madrid, followed by reinstatement by Court order.",
"Parking ban on no-label vehicles and two streets released from restrictions (Mártires de Alcalá and Seminario de Nobles).",
"'Madrid 360' replaces Madrid Central. Changes include the replacement of Madrid Central by the 'Zona de Bajas Emisiones de Especial Protección Distrito Centro', a new LEZ in Plaza Elíptica and the provision of banning entrance to no-label vehicles into the LEZ starting in January 2022.")

# Create dataframe with dates, events and event descriptions
timeline_df <- data.frame(dates, event, descriptions)

# Convert 'dates' to a factor variable with the desired order of levels
timeline_df$dates <- factor(timeline_df$dates, levels = dates, ordered = TRUE)

# Rename columns in the final dataframe
names(timeline_df) <- c("Date", "Event", "Event Description")

# Format table 
kable(timeline_df, format = "html", row.names = FALSE) %>%
  kable_styling(c("striped", "hover", "bordered"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) 
Date Event Event Description
30th November, 2018 Official day of entry into force of Madrid Central First phase of the transition period. No active cameras, with police officers monitoring and enforcing rules.
1st January, 2019 Activation of camera monitoring system Automatic monitoring system with cameras at the LEZ perimeter. Warnings for non-compliance sent by post.
15th March, 2019 Beginning of sanctions Infractions first subject to fines ranging up to 90 euros, or 45 euros with early payment.
1st - 7th July, 2019 Temporary suspension of LEZ sanctions Suspension of sanctions under the new mayor of Madrid, followed by reinstatement by Court order.
1st January, 2020 Introduction of slightly modified rules Parking ban on no-label vehicles and two streets released from restrictions (Mártires de Alcalá and Seminario de Nobles).
July, 2021 Approval of ‘Madrid 360’ ‘Madrid 360’ replaces Madrid Central. Changes include the replacement of Madrid Central by the ‘Zona de Bajas Emisiones de Especial Protección Distrito Centro’, a new LEZ in Plaza Elíptica and the provision of banning entrance to no-label vehicles into the LEZ starting in January 2022.

REFERENCES

Gomila, R., and C. S. Clark. 2020. “Missing Data in Experiments: Challenges and Solutions”. American Psychological Association, 27(2): 143–155.

Pearl, J. 2009. Causality. 2nd ed. Cambridge, MA: Cambridge University Press.

Rubin, D. B. 1976. “Inference and Missing Data”. Biometrika, 63(3): 581–592.

Snow, J. 1855. On the Mode of Communication of Cholera. John Churchill.